Re: bulk upload to Elasticsearch and shuffle behavior

2015-09-01 Thread Igor Berman
Hi Eric, I see that you solved your problem. Imho, when you do repartition you split your work into 2 stages, so your hbase lookup happens at first stage, and upload to ES happens after shuffle on next stage, so without repartition it's hard to tell where is ES upload and where is Hbase lookup

Re: bulk upload to Elasticsearch and shuffle behavior

2015-08-31 Thread Eric Walker
I think I have found out what was causing me difficulties. It seems I was reading too much into the stage description shown in the "Stages" tab of the Spark application UI. While it said "repartition at NativeMethodAccessorImpl.java:-2", I can infer from the network traffic and from its response

bulk upload to Elasticsearch and shuffle behavior

2015-08-31 Thread Eric Walker
Hi, I am working on a pipeline that carries out a number of stages, the last of which is to build some large JSON objects from information in the preceding stages. The JSON objects are then uploaded to Elasticsearch in bulk. If I carry out a shuffle via a `repartition` call after the JSON