subject:"Re\: Spark works with the data in another cluster\(Elasticsearch\)"

Re: Spark works with the data in another cluster(Elasticsearch)

2015-08-25 Thread Nick Pentreath

While it's true locality might speed things up, I'd say it's a very bad idea to
mix your Spark and ES clusters - if your ES cluster is serving production
queries (and in particular using aggregations), you'll run into performance
issues on your production ES cluster.

ES-hadoop uses ES scan scroll to pull data pretty efficiently, so pulling it
across the network is not too bad. If you do need to avoid that, pull the data
and write what you need to HDFS as say parquet files (eg pull data daily and
write it, then you have all data available on your Spark cluster).

And of course ensure thatbwhen you do pull data from ES to Spark, you cache it
to avoid hitting the network again

—
Sent from Mailbox

On Tue, Aug 25, 2015 at 12:01 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

If the data is local to the machine then obviously it will be faster
compared to pulling it through the network and storing it locally (either
memory or disk etc). Have a look at the data locality
http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html
.
Thanks
Best Regards
On Tue, Aug 18, 2015 at 8:09 PM, gen tang gen.tan...@gmail.com wrote:
Hi,

Currently, I have my data in the cluster of Elasticsearch and I try to use
spark to analyse those data.
The cluster of Elasticsearch and the cluster of spark are two different
clusters. And I use hadoop input format(es-hadoop) to read data in ES.

I am wondering how this environment affect the speed of analysis.
If I understand well, spark will read data from ES cluster and do
calculate on its own cluster(include writing shuffle result on its own
machine), Is this right? If this is correct, I think that the performance
will just a little bit slower than the data stored on the same cluster.

I will be appreciated if someone can share his/her experience about using
spark with elasticsearch.

Thanks a lot in advance for your help.

Cheers
Gen

Re: Spark works with the data in another cluster(Elasticsearch)

2015-08-25 Thread Akhil Das

If the data is local to the machine then obviously it will be faster
compared to pulling it through the network and storing it locally (either
memory or disk etc). Have a look at the data locality
http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html
.

Thanks
Best Regards

On Tue, Aug 18, 2015 at 8:09 PM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 Currently, I have my data in the cluster of Elasticsearch and I try to use
 spark to analyse those data.
 The cluster of Elasticsearch and the cluster of spark are two different
 clusters. And I use hadoop input format(es-hadoop) to read data in ES.

 I am wondering how this environment affect the speed of analysis.
 If I understand well, spark will read data from ES cluster and do
 calculate on its own cluster(include writing shuffle result on its own
 machine), Is this right? If this is correct, I think that the performance
 will just a little bit slower than the data stored on the same cluster.

 I will be appreciated if someone can share his/her experience about using
 spark with elasticsearch.

 Thanks a lot in advance for your help.

 Cheers
 Gen

Re: Spark works with the data in another cluster(Elasticsearch)

2015-08-25 Thread gen tang

Great advice.
Thanks a lot Nick.

In fact, if we use rdd.persist(DISK) command at the beginning of the
program to avoid hitting the network again and again. The speed is not
influenced a lot. In my case, it is just 1 min more compared to the
situation that we put the data in local HDFS.

Cheers
Gen

On Tue, Aug 25, 2015 at 6:26 PM, Nick Pentreath nick.pentre...@gmail.com
wrote:

While it's true locality might speed things up, I'd say it's a very bad
idea to mix your Spark and ES clusters - if your ES cluster is serving
production queries (and in particular using aggregations), you'll run into
performance issues on your production ES cluster.

ES-hadoop uses ES scan scroll to pull data pretty efficiently, so
pulling it across the network is not too bad. If you do need to avoid that,
pull the data and write what you need to HDFS as say parquet files (eg pull
data daily and write it, then you have all data available on your Spark
cluster).

And of course ensure thatbwhen you do pull data from ES to Spark, you
cache it to avoid hitting the network again

—
Sent from Mailbox https://www.dropbox.com/mailbox

On Tue, Aug 25, 2015 at 12:01 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

Thanks
Best Regards

On Tue, Aug 18, 2015 at 8:09 PM, gen tang gen.tan...@gmail.com wrote:

Hi,

Currently, I have my data in the cluster of Elasticsearch and I try to
use spark to analyse those data.
The cluster of Elasticsearch and the cluster of spark are two different
clusters. And I use hadoop input format(es-hadoop) to read data in ES.

I will be appreciated if someone can share his/her experience about
using spark with elasticsearch.

Thanks a lot in advance for your help.

Cheers
Gen

Re: Spark works with the data in another cluster(Elasticsearch)

Re: Spark works with the data in another cluster(Elasticsearch)

Re: Spark works with the data in another cluster(Elasticsearch)

3 matches

Site Navigation

Mail list logo

Footer information