I'm trying to get a spark job running that pulls several million documents 
from an Elasticsearch cluster for some analytics that cannot be done via 
aggregations.  It was my understanding that es-hadoop maintained data 
locality when the spark cluster was running alongside the elasticsearch 
cluster, but I am not finding that to be the case.  My setup is 1 index, 20 
ES nodes, 20 shards.  There is one task created per elasticsearch shard, 
but these tasks aren't distributed evenly among my spark cluster (eg. 3 
spark nodes end up getting all of the work, and the tasks that a spark node 
gets don't contain the ES shard from the same node.).  Could there be 
something wrong with my setup in this case?

The job does end up running correctly, but it is very slow (approx 5 
minutes for a 3 million doc count) and there is clearly a lot of network 
IO.  Any tips would be appreciated.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/90da3997-dafe-4ede-b5c4-2297f03066cc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to