Re: Spark works with the data in another cluster(Elasticsearch)

2015-08-25 Thread Nick Pentreath
While it's true locality might speed things up, I'd say it's a very bad idea to 
mix your Spark and ES clusters - if your ES cluster is serving production 
queries (and in particular using aggregations), you'll run into performance 
issues on your production ES cluster.




ES-hadoop uses ES scan  scroll to pull data pretty efficiently, so pulling it 
across the network is not too bad. If you do need to avoid that, pull the data 
and write what you need to HDFS as say parquet files (eg pull data daily and 
write it, then you have all data available on your Spark cluster).




And of course ensure thatbwhen you do pull data from ES to Spark, you cache it 
to avoid hitting the network again



—
Sent from Mailbox

On Tue, Aug 25, 2015 at 12:01 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 If the data is local to the machine then obviously it will be faster
 compared to pulling it through the network and storing it locally (either
 memory or disk etc). Have a look at the data locality
 http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html
 .
 Thanks
 Best Regards
 On Tue, Aug 18, 2015 at 8:09 PM, gen tang gen.tan...@gmail.com wrote:
 Hi,

 Currently, I have my data in the cluster of Elasticsearch and I try to use
 spark to analyse those data.
 The cluster of Elasticsearch and the cluster of spark are two different
 clusters. And I use hadoop input format(es-hadoop) to read data in ES.

 I am wondering how this environment affect the speed of analysis.
 If I understand well, spark will read data from ES cluster and do
 calculate on its own cluster(include writing shuffle result on its own
 machine), Is this right? If this is correct, I think that the performance
 will just a little bit slower than the data stored on the same cluster.

 I will be appreciated if someone can share his/her experience about using
 spark with elasticsearch.

 Thanks a lot in advance for your help.

 Cheers
 Gen


Re: Spark works with the data in another cluster(Elasticsearch)

2015-08-25 Thread Akhil Das
If the data is local to the machine then obviously it will be faster
compared to pulling it through the network and storing it locally (either
memory or disk etc). Have a look at the data locality
http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html
.

Thanks
Best Regards

On Tue, Aug 18, 2015 at 8:09 PM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 Currently, I have my data in the cluster of Elasticsearch and I try to use
 spark to analyse those data.
 The cluster of Elasticsearch and the cluster of spark are two different
 clusters. And I use hadoop input format(es-hadoop) to read data in ES.

 I am wondering how this environment affect the speed of analysis.
 If I understand well, spark will read data from ES cluster and do
 calculate on its own cluster(include writing shuffle result on its own
 machine), Is this right? If this is correct, I think that the performance
 will just a little bit slower than the data stored on the same cluster.

 I will be appreciated if someone can share his/her experience about using
 spark with elasticsearch.

 Thanks a lot in advance for your help.

 Cheers
 Gen



Re: Spark works with the data in another cluster(Elasticsearch)

2015-08-25 Thread gen tang
Great advice.
Thanks a lot Nick.

In fact, if we use rdd.persist(DISK) command at the beginning of the
program to avoid hitting the network again and again. The speed is not
influenced a lot. In my case, it is just 1 min more compared to the
situation that we put the data in local HDFS.

Cheers
Gen

On Tue, Aug 25, 2015 at 6:26 PM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 While it's true locality might speed things up, I'd say it's a very bad
 idea to mix your Spark and ES clusters - if your ES cluster is serving
 production queries (and in particular using aggregations), you'll run into
 performance issues on your production ES cluster.

 ES-hadoop uses ES scan  scroll to pull data pretty efficiently, so
 pulling it across the network is not too bad. If you do need to avoid that,
 pull the data and write what you need to HDFS as say parquet files (eg pull
 data daily and write it, then you have all data available on your Spark
 cluster).

 And of course ensure thatbwhen you do pull data from ES to Spark, you
 cache it to avoid hitting the network again

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Tue, Aug 25, 2015 at 12:01 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 If the data is local to the machine then obviously it will be faster
 compared to pulling it through the network and storing it locally (either
 memory or disk etc). Have a look at the data locality
 http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html
 .

 Thanks
 Best Regards

 On Tue, Aug 18, 2015 at 8:09 PM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 Currently, I have my data in the cluster of Elasticsearch and I try to
 use spark to analyse those data.
 The cluster of Elasticsearch and the cluster of spark are two different
 clusters. And I use hadoop input format(es-hadoop) to read data in ES.

 I am wondering how this environment affect the speed of analysis.
 If I understand well, spark will read data from ES cluster and do
 calculate on its own cluster(include writing shuffle result on its own
 machine), Is this right? If this is correct, I think that the performance
 will just a little bit slower than the data stored on the same cluster.

 I will be appreciated if someone can share his/her experience about
 using spark with elasticsearch.

 Thanks a lot in advance for your help.

 Cheers
 Gen