Hi Richard, thanks for the response. My use case is weird I need to process
data row by row for one partition and update required rows. Updated rows
percentage would be 30%. As per above stackoverflow.com answer suggestions
I refactored code to use mappartitionswithindex
JavaRDD indexedRdd =
Hi,
what is the reasoning behind the use of `coalesce(1,false)`? This is saying
to aggregate all data into a single partition, which must fit in memory on
one node in the Spark cluster. If the cluster has more than one node it
must shuffle to move the data. It doesn't seem like the following map
Hi I have Spark job which does some processing on ORC data and stores back
ORC data using DataFrameWriter save() API introduced in Spark 1.4.0. I have
the following piece of code which is using heavy shuffle memory. How do I
optimize below code? Is there anything wrong with it? It is working fine