subject:"Why is huge data shuffling in Spark when using union\(\)\/coalesce\(1,false\) on DataFrame\?"

Re: Why is huge data shuffling in Spark when using union()/coalesce(1,false) on DataFrame?

2015-09-09 Thread Umesh Kacha

Hi Richard, thanks for the response. My use case is weird I need to process data row by row for one partition and update required rows. Updated rows percentage would be 30%. As per above stackoverflow.com answer suggestions I refactored code to use mappartitionswithindex JavaRDD indexedRdd =

Re: Why is huge data shuffling in Spark when using union()/coalesce(1,false) on DataFrame?

2015-09-08 Thread Richard Marscher

Hi, what is the reasoning behind the use of `coalesce(1,false)`? This is saying to aggregate all data into a single partition, which must fit in memory on one node in the Spark cluster. If the cluster has more than one node it must shuffle to move the data. It doesn't seem like the following map

Why is huge data shuffling in Spark when using union()/coalesce(1,false) on DataFrame?

2015-09-04 Thread unk1102

Hi I have Spark job which does some processing on ORC data and stores back ORC data using DataFrameWriter save() API introduced in Spark 1.4.0. I have the following piece of code which is using heavy shuffle memory. How do I optimize below code? Is there anything wrong with it? It is working fine