Re: Spark works with the data in another cluster(Elasticsearch)
While it's true locality might speed things up, I'd say it's a very bad idea to mix your Spark and ES clusters - if your ES cluster is serving production queries (and in particular using aggregations), you'll run into performance issues on your production ES cluster. ES-hadoop uses ES scan scroll to pull data pretty efficiently, so pulling it across the network is not too bad. If you do need to avoid that, pull the data and write what you need to HDFS as say parquet files (eg pull data daily and write it, then you have all data available on your Spark cluster). And of course ensure thatbwhen you do pull data from ES to Spark, you cache it to avoid hitting the network again — Sent from Mailbox On Tue, Aug 25, 2015 at 12:01 PM, Akhil Das ak...@sigmoidanalytics.com wrote: If the data is local to the machine then obviously it will be faster compared to pulling it through the network and storing it locally (either memory or disk etc). Have a look at the data locality http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html . Thanks Best Regards On Tue, Aug 18, 2015 at 8:09 PM, gen tang gen.tan...@gmail.com wrote: Hi, Currently, I have my data in the cluster of Elasticsearch and I try to use spark to analyse those data. The cluster of Elasticsearch and the cluster of spark are two different clusters. And I use hadoop input format(es-hadoop) to read data in ES. I am wondering how this environment affect the speed of analysis. If I understand well, spark will read data from ES cluster and do calculate on its own cluster(include writing shuffle result on its own machine), Is this right? If this is correct, I think that the performance will just a little bit slower than the data stored on the same cluster. I will be appreciated if someone can share his/her experience about using spark with elasticsearch. Thanks a lot in advance for your help. Cheers Gen
Re: Spark works with the data in another cluster(Elasticsearch)
If the data is local to the machine then obviously it will be faster compared to pulling it through the network and storing it locally (either memory or disk etc). Have a look at the data locality http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html . Thanks Best Regards On Tue, Aug 18, 2015 at 8:09 PM, gen tang gen.tan...@gmail.com wrote: Hi, Currently, I have my data in the cluster of Elasticsearch and I try to use spark to analyse those data. The cluster of Elasticsearch and the cluster of spark are two different clusters. And I use hadoop input format(es-hadoop) to read data in ES. I am wondering how this environment affect the speed of analysis. If I understand well, spark will read data from ES cluster and do calculate on its own cluster(include writing shuffle result on its own machine), Is this right? If this is correct, I think that the performance will just a little bit slower than the data stored on the same cluster. I will be appreciated if someone can share his/her experience about using spark with elasticsearch. Thanks a lot in advance for your help. Cheers Gen
Re: Spark works with the data in another cluster(Elasticsearch)
Great advice. Thanks a lot Nick. In fact, if we use rdd.persist(DISK) command at the beginning of the program to avoid hitting the network again and again. The speed is not influenced a lot. In my case, it is just 1 min more compared to the situation that we put the data in local HDFS. Cheers Gen On Tue, Aug 25, 2015 at 6:26 PM, Nick Pentreath nick.pentre...@gmail.com wrote: While it's true locality might speed things up, I'd say it's a very bad idea to mix your Spark and ES clusters - if your ES cluster is serving production queries (and in particular using aggregations), you'll run into performance issues on your production ES cluster. ES-hadoop uses ES scan scroll to pull data pretty efficiently, so pulling it across the network is not too bad. If you do need to avoid that, pull the data and write what you need to HDFS as say parquet files (eg pull data daily and write it, then you have all data available on your Spark cluster). And of course ensure thatbwhen you do pull data from ES to Spark, you cache it to avoid hitting the network again — Sent from Mailbox https://www.dropbox.com/mailbox On Tue, Aug 25, 2015 at 12:01 PM, Akhil Das ak...@sigmoidanalytics.com wrote: If the data is local to the machine then obviously it will be faster compared to pulling it through the network and storing it locally (either memory or disk etc). Have a look at the data locality http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html . Thanks Best Regards On Tue, Aug 18, 2015 at 8:09 PM, gen tang gen.tan...@gmail.com wrote: Hi, Currently, I have my data in the cluster of Elasticsearch and I try to use spark to analyse those data. The cluster of Elasticsearch and the cluster of spark are two different clusters. And I use hadoop input format(es-hadoop) to read data in ES. I am wondering how this environment affect the speed of analysis. If I understand well, spark will read data from ES cluster and do calculate on its own cluster(include writing shuffle result on its own machine), Is this right? If this is correct, I think that the performance will just a little bit slower than the data stored on the same cluster. I will be appreciated if someone can share his/her experience about using spark with elasticsearch. Thanks a lot in advance for your help. Cheers Gen