If the data is local to the machine then obviously it will be faster compared to pulling it through the network and storing it locally (either memory or disk etc). Have a look at the data locality <http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html> .
Thanks Best Regards On Tue, Aug 18, 2015 at 8:09 PM, gen tang <gen.tan...@gmail.com> wrote: > Hi, > > Currently, I have my data in the cluster of Elasticsearch and I try to use > spark to analyse those data. > The cluster of Elasticsearch and the cluster of spark are two different > clusters. And I use hadoop input format(es-hadoop) to read data in ES. > > I am wondering how this environment affect the speed of analysis. > If I understand well, spark will read data from ES cluster and do > calculate on its own cluster(include writing shuffle result on its own > machine), Is this right? If this is correct, I think that the performance > will just a little bit slower than the data stored on the same cluster. > > I will be appreciated if someone can share his/her experience about using > spark with elasticsearch. > > Thanks a lot in advance for your help. > > Cheers > Gen >