I'm trying to get a spark job running that pulls several million documents from an Elasticsearch cluster for some analytics that cannot be done via aggregations. It was my understanding that es-hadoop maintained data locality when the spark cluster was running alongside the elasticsearch cluster, but I am not finding that to be the case. My setup is 1 index, 20 ES nodes, 20 shards. There is one task created per elasticsearch shard, but these tasks aren't distributed evenly among my spark cluster (eg. 3 spark nodes end up getting all of the work, and the tasks that a spark node gets don't contain the ES shard from the same node.). Could there be something wrong with my setup in this case?
The job does end up running correctly, but it is very slow (approx 5 minutes for a 3 million doc count) and there is clearly a lot of network IO. Any tips would be appreciated. -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/90da3997-dafe-4ede-b5c4-2297f03066cc%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.