While it's true locality might speed things up, I'd say it's a very bad idea to
mix your Spark and ES clusters - if your ES cluster is serving production
queries (and in particular using aggregations), you'll run into performance
issues on your production ES cluster.
ES-hadoop uses ES scan
If the data is local to the machine then obviously it will be faster
compared to pulling it through the network and storing it locally (either
memory or disk etc). Have a look at the data locality
Great advice.
Thanks a lot Nick.
In fact, if we use rdd.persist(DISK) command at the beginning of the
program to avoid hitting the network again and again. The speed is not
influenced a lot. In my case, it is just 1 min more compared to the
situation that we put the data in local HDFS.
Cheers
Hi,
Currently, I have my data in the cluster of Elasticsearch and I try to use
spark to analyse those data.
The cluster of Elasticsearch and the cluster of spark are two different
clusters. And I use hadoop input format(es-hadoop) to read data in ES.
I am wondering how this environment affect