Great advice. Thanks a lot Nick. In fact, if we use rdd.persist(DISK) command at the beginning of the program to avoid hitting the network again and again. The speed is not influenced a lot. In my case, it is just 1 min more compared to the situation that we put the data in local HDFS.
Cheers Gen On Tue, Aug 25, 2015 at 6:26 PM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > While it's true locality might speed things up, I'd say it's a very bad > idea to mix your Spark and ES clusters - if your ES cluster is serving > production queries (and in particular using aggregations), you'll run into > performance issues on your production ES cluster. > > ES-hadoop uses ES scan & scroll to pull data pretty efficiently, so > pulling it across the network is not too bad. If you do need to avoid that, > pull the data and write what you need to HDFS as say parquet files (eg pull > data daily and write it, then you have all data available on your Spark > cluster). > > And of course ensure thatbwhen you do pull data from ES to Spark, you > cache it to avoid hitting the network again > > — > Sent from Mailbox <https://www.dropbox.com/mailbox> > > > On Tue, Aug 25, 2015 at 12:01 PM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> If the data is local to the machine then obviously it will be faster >> compared to pulling it through the network and storing it locally (either >> memory or disk etc). Have a look at the data locality >> <http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html> >> . >> >> Thanks >> Best Regards >> >> On Tue, Aug 18, 2015 at 8:09 PM, gen tang <gen.tan...@gmail.com> wrote: >> >>> Hi, >>> >>> Currently, I have my data in the cluster of Elasticsearch and I try to >>> use spark to analyse those data. >>> The cluster of Elasticsearch and the cluster of spark are two different >>> clusters. And I use hadoop input format(es-hadoop) to read data in ES. >>> >>> I am wondering how this environment affect the speed of analysis. >>> If I understand well, spark will read data from ES cluster and do >>> calculate on its own cluster(include writing shuffle result on its own >>> machine), Is this right? If this is correct, I think that the performance >>> will just a little bit slower than the data stored on the same cluster. >>> >>> I will be appreciated if someone can share his/her experience about >>> using spark with elasticsearch. >>> >>> Thanks a lot in advance for your help. >>> >>> Cheers >>> Gen >>> >> >> >