While it's true locality might speed things up, I'd say it's a very bad idea to 
mix your Spark and ES clusters - if your ES cluster is serving production 
queries (and in particular using aggregations), you'll run into performance 
issues on your production ES cluster.




ES-hadoop uses ES scan & scroll to pull data pretty efficiently, so pulling it 
across the network is not too bad. If you do need to avoid that, pull the data 
and write what you need to HDFS as say parquet files (eg pull data daily and 
write it, then you have all data available on your Spark cluster).




And of course ensure thatbwhen you do pull data from ES to Spark, you cache it 
to avoid hitting the network again



—
Sent from Mailbox

On Tue, Aug 25, 2015 at 12:01 PM, Akhil Das <[email protected]>
wrote:

> If the data is local to the machine then obviously it will be faster
> compared to pulling it through the network and storing it locally (either
> memory or disk etc). Have a look at the data locality
> <http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html>
> .
> Thanks
> Best Regards
> On Tue, Aug 18, 2015 at 8:09 PM, gen tang <[email protected]> wrote:
>> Hi,
>>
>> Currently, I have my data in the cluster of Elasticsearch and I try to use
>> spark to analyse those data.
>> The cluster of Elasticsearch and the cluster of spark are two different
>> clusters. And I use hadoop input format(es-hadoop) to read data in ES.
>>
>> I am wondering how this environment affect the speed of analysis.
>> If I understand well, spark will read data from ES cluster and do
>> calculate on its own cluster(include writing shuffle result on its own
>> machine), Is this right? If this is correct, I think that the performance
>> will just a little bit slower than the data stored on the same cluster.
>>
>> I will be appreciated if someone can share his/her experience about using
>> spark with elasticsearch.
>>
>> Thanks a lot in advance for your help.
>>
>> Cheers
>> Gen
>>

Reply via email to