I see you reposted with more details:
"I have 2 TB of
skewed data to process and then convert rdd into dataframe and use it as
table in hiveContext.sql(). I am using 60 executors with 20 GB memory and 4
cores."
If I'm reading that correctly, you have 2TB of data and 1.2TB of memory in
the cluster
Do you have any details about the cluster you are running this against? The
memory per executor/node, number of executors, and such? Even at a shuffle
setting of 1000 that would be roughly 1GB per partition assuming the 1TB of
data includes overheads in the JVM. Maybe try another order of magnitude
Hi I am using Spark SQL actually hiveContext.sql() which uses group by
queries and I am running into OOM issues. So thinking of increasing value of
spark.sql.shuffle.partition from 200 default to 1000 but it is not helping.
Please correct me if I am wrong this partitions will share data shuffle loa