Do you have any details about the cluster you are running this against? The memory per executor/node, number of executors, and such? Even at a shuffle setting of 1000 that would be roughly 1GB per partition assuming the 1TB of data includes overheads in the JVM. Maybe try another order of magnitude higher for number of shuffle partitions and see where that gets you?
On Tue, Sep 1, 2015 at 12:11 PM, unk1102 <umesh.ka...@gmail.com> wrote: > Hi I am using Spark SQL actually hiveContext.sql() which uses group by > queries and I am running into OOM issues. So thinking of increasing value > of > spark.sql.shuffle.partition from 200 default to 1000 but it is not helping. > Please correct me if I am wrong this partitions will share data shuffle > load > so more the partitions less data to hold. Please guide I am new to Spark. I > am using Spark 1.4.0 and I have around 1TB of uncompressed data to process > using hiveContext.sql() group by queries. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/What-should-be-the-optimal-value-for-spark-sql-shuffle-partition-tp24543.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Richard Marscher* Software Engineer Localytics Localytics.com <http://localytics.com/> | Our Blog <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | Facebook <http://facebook.com/localytics> | LinkedIn <http://www.linkedin.com/company/1148792?trk=tyah>