Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-23 Thread tridib
Setting spark.sql.shuffle.partitions = 2000 solved my issue. I am able to join 2 1 billion rows tables in 3 minutes. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24782.html Sent from the

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread tridib
By skewed did you mean it's not distributed uniformly across partition? All of my columns are string and almost of same size. i.e. id1,field11,fields12 id2,field21,field22 -- View this message in context:

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread dmytro
Could it be that your data is skewed? Do you have variable-length column types? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24762.html Sent from the Apache Spark User List mailing list

RE: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread java8964
Or at least tell us how many partitions you are using. Yong > Date: Tue, 22 Sep 2015 02:06:15 -0700 > From: belevts...@gmail.com > To: user@spark.apache.org > Subject: Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables > > Could it be that your data is skewed? Do

RE: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread java8964
Or at least tell us how many partitions you are using. Yong > Date: Tue, 22 Sep 2015 02:06:15 -0700 > From: belevts...@gmail.com > To: user@spark.apache.org > Subject: Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables > > Could it be that your data is skewed? Do

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-21 Thread tridib
Did you get any solution to this? I am getting same issue. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24759.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread ayan guha
You can use custom partitioner to redistribution using partitionby On 4 May 2015 15:37, Nick Travers n.e.trav...@gmail.com wrote: I'm currently trying to join two large tables (order 1B rows each) using Spark SQL (1.3.0) and am running into long GC pauses which bring the job to a halt. I'm

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread Nick Travers
Could you be more specific in how this is done? A DataFrame class doesn't have that method. On Sun, May 3, 2015 at 11:07 PM, ayan guha guha.a...@gmail.com wrote: You can use custom partitioner to redistribution using partitionby On 4 May 2015 15:37, Nick Travers n.e.trav...@gmail.com wrote:

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread Richard Marscher
In regards to the large GC pauses, assuming you allocated all 100GB of memory per worker you may consider running with less memory on your Worker nodes, or splitting up the available memory on the Worker nodes amongst several worker instances. The JVM's garbage collection starts to become very

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread Michael Armbrust
If you data is evenly distributed (i.e. no skewed datapoints in your join keys), it can also help to increase spark.sql.shuffle.partitions (default is 200). On Mon, May 4, 2015 at 8:03 AM, Richard Marscher rmarsc...@localytics.com wrote: In regards to the large GC pauses, assuming you allocated