Hi Eric,
I see that you solved your problem. Imho, when you do repartition you split
your work into 2 stages, so your hbase lookup happens at first stage, and
upload to ES happens after shuffle on next stage, so without repartition
it's hard to tell where is ES upload and where is Hbase lookup
I think I have found out what was causing me difficulties. It seems I was
reading too much into the stage description shown in the "Stages" tab of
the Spark application UI. While it said "repartition at
NativeMethodAccessorImpl.java:-2", I can infer from the network traffic and
from its response
Hi,
I am working on a pipeline that carries out a number of stages, the last of
which is to build some large JSON objects from information in the preceding
stages. The JSON objects are then uploaded to Elasticsearch in bulk.
If I carry out a shuffle via a `repartition` call after the JSON