Re: What should be the optimal value for spark.sql.shuffle.partition?

2015-09-09 Thread Richard Marscher
I see you reposted with more details: "I have 2 TB of skewed data to process and then convert rdd into dataframe and use it as table in hiveContext.sql(). I am using 60 executors with 20 GB memory and 4 cores." If I'm reading that correctly, you have 2TB of data and 1.2TB of memory in the

Re: What should be the optimal value for spark.sql.shuffle.partition?

2015-09-09 Thread Richard Marscher
Do you have any details about the cluster you are running this against? The memory per executor/node, number of executors, and such? Even at a shuffle setting of 1000 that would be roughly 1GB per partition assuming the 1TB of data includes overheads in the JVM. Maybe try another order of

What should be the optimal value for spark.sql.shuffle.partition?

2015-09-01 Thread unk1102
Hi I am using Spark SQL actually hiveContext.sql() which uses group by queries and I am running into OOM issues. So thinking of increasing value of spark.sql.shuffle.partition from 200 default to 1000 but it is not helping. Please correct me if I am wrong this partitions will share data shuffle