Great advice.
Thanks a lot Nick.

In fact, if we use rdd.persist(DISK) command at the beginning of the
program to avoid hitting the network again and again. The speed is not
influenced a lot. In my case, it is just 1 min more compared to the
situation that we put the data in local HDFS.

Cheers
Gen

On Tue, Aug 25, 2015 at 6:26 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> While it's true locality might speed things up, I'd say it's a very bad
> idea to mix your Spark and ES clusters - if your ES cluster is serving
> production queries (and in particular using aggregations), you'll run into
> performance issues on your production ES cluster.
>
> ES-hadoop uses ES scan & scroll to pull data pretty efficiently, so
> pulling it across the network is not too bad. If you do need to avoid that,
> pull the data and write what you need to HDFS as say parquet files (eg pull
> data daily and write it, then you have all data available on your Spark
> cluster).
>
> And of course ensure thatbwhen you do pull data from ES to Spark, you
> cache it to avoid hitting the network again
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Tue, Aug 25, 2015 at 12:01 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> If the data is local to the machine then obviously it will be faster
>> compared to pulling it through the network and storing it locally (either
>> memory or disk etc). Have a look at the data locality
>> <http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html>
>> .
>>
>> Thanks
>> Best Regards
>>
>> On Tue, Aug 18, 2015 at 8:09 PM, gen tang <gen.tan...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Currently, I have my data in the cluster of Elasticsearch and I try to
>>> use spark to analyse those data.
>>> The cluster of Elasticsearch and the cluster of spark are two different
>>> clusters. And I use hadoop input format(es-hadoop) to read data in ES.
>>>
>>> I am wondering how this environment affect the speed of analysis.
>>> If I understand well, spark will read data from ES cluster and do
>>> calculate on its own cluster(include writing shuffle result on its own
>>> machine), Is this right? If this is correct, I think that the performance
>>> will just a little bit slower than the data stored on the same cluster.
>>>
>>> I will be appreciated if someone can share his/her experience about
>>> using spark with elasticsearch.
>>>
>>> Thanks a lot in advance for your help.
>>>
>>> Cheers
>>> Gen
>>>
>>
>>
>

Reply via email to