I'm trying to get a spark job running that pulls several million documents
from an Elasticsearch cluster for some analytics that cannot be done via
aggregations. It was my understanding that es-hadoop maintained data
locality when the spark cluster was running alongside the elasticsearch
For the record, what spark and es-hadoop version are you using?
For each shard in your index, es-hadoop creates one Spark task which gets informed of the whereabouts of the underlying
shard.
So in your case, you would end up with 20 tasks/workers, one per shard,
streaming data back to the