I see you reposted with more details: "I have 2 TB of skewed data to process and then convert rdd into dataframe and use it as table in hiveContext.sql(). I am using 60 executors with 20 GB memory and 4 cores."
If I'm reading that correctly, you have 2TB of data and 1.2TB of memory in the cluster. I think that's a fundamental problem up front. If it's skewed then that will be even worse for doing aggregation. I think to start the data either needs to be broken down or the cluster upgraded unfortunately. On Wed, Sep 9, 2015 at 5:41 PM, Richard Marscher <rmarsc...@localytics.com> wrote: > Do you have any details about the cluster you are running this against? > The memory per executor/node, number of executors, and such? Even at a > shuffle setting of 1000 that would be roughly 1GB per partition assuming > the 1TB of data includes overheads in the JVM. Maybe try another order of > magnitude higher for number of shuffle partitions and see where that gets > you? > > On Tue, Sep 1, 2015 at 12:11 PM, unk1102 <umesh.ka...@gmail.com> wrote: > >> Hi I am using Spark SQL actually hiveContext.sql() which uses group by >> queries and I am running into OOM issues. So thinking of increasing value >> of >> spark.sql.shuffle.partition from 200 default to 1000 but it is not >> helping. >> Please correct me if I am wrong this partitions will share data shuffle >> load >> so more the partitions less data to hold. Please guide I am new to Spark. >> I >> am using Spark 1.4.0 and I have around 1TB of uncompressed data to process >> using hiveContext.sql() group by queries. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/What-should-be-the-optimal-value-for-spark-sql-shuffle-partition-tp24543.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > > > -- > *Richard Marscher* > Software Engineer > Localytics > Localytics.com <http://localytics.com/> | Our Blog > <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | > Facebook <http://facebook.com/localytics> | LinkedIn > <http://www.linkedin.com/company/1148792?trk=tyah> > -- *Richard Marscher* Software Engineer Localytics Localytics.com <http://localytics.com/> | Our Blog <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | Facebook <http://facebook.com/localytics> | LinkedIn <http://www.linkedin.com/company/1148792?trk=tyah>