While it's true locality might speed things up, I'd say it's a very bad idea to mix your Spark and ES clusters - if your ES cluster is serving production queries (and in particular using aggregations), you'll run into performance issues on your production ES cluster.
ES-hadoop uses ES scan & scroll to pull data pretty efficiently, so pulling it across the network is not too bad. If you do need to avoid that, pull the data and write what you need to HDFS as say parquet files (eg pull data daily and write it, then you have all data available on your Spark cluster). And of course ensure thatbwhen you do pull data from ES to Spark, you cache it to avoid hitting the network again — Sent from Mailbox On Tue, Aug 25, 2015 at 12:01 PM, Akhil Das <[email protected]> wrote: > If the data is local to the machine then obviously it will be faster > compared to pulling it through the network and storing it locally (either > memory or disk etc). Have a look at the data locality > <http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html> > . > Thanks > Best Regards > On Tue, Aug 18, 2015 at 8:09 PM, gen tang <[email protected]> wrote: >> Hi, >> >> Currently, I have my data in the cluster of Elasticsearch and I try to use >> spark to analyse those data. >> The cluster of Elasticsearch and the cluster of spark are two different >> clusters. And I use hadoop input format(es-hadoop) to read data in ES. >> >> I am wondering how this environment affect the speed of analysis. >> If I understand well, spark will read data from ES cluster and do >> calculate on its own cluster(include writing shuffle result on its own >> machine), Is this right? If this is correct, I think that the performance >> will just a little bit slower than the data stored on the same cluster. >> >> I will be appreciated if someone can share his/her experience about using >> spark with elasticsearch. >> >> Thanks a lot in advance for your help. >> >> Cheers >> Gen >>
