Setting spark.sql.shuffle.partitions = 2000 solved my issue. I am able to
join 2 1 billion rows tables in 3 minutes.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24782.html
Sent from the
By skewed did you mean it's not distributed uniformly across partition?
All of my columns are string and almost of same size. i.e.
id1,field11,fields12
id2,field21,field22
--
View this message in context:
Could it be that your data is skewed? Do you have variable-length column
types?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24762.html
Sent from the Apache Spark User List mailing list
Or at least tell us how many partitions you are using.
Yong
> Date: Tue, 22 Sep 2015 02:06:15 -0700
> From: belevts...@gmail.com
> To: user@spark.apache.org
> Subject: Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables
>
> Could it be that your data is skewed? Do
Or at least tell us how many partitions you are using.
Yong
> Date: Tue, 22 Sep 2015 02:06:15 -0700
> From: belevts...@gmail.com
> To: user@spark.apache.org
> Subject: Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables
>
> Could it be that your data is skewed? Do
Did you get any solution to this? I am getting same issue.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24759.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
You can use custom partitioner to redistribution using partitionby
On 4 May 2015 15:37, Nick Travers n.e.trav...@gmail.com wrote:
I'm currently trying to join two large tables (order 1B rows each) using
Spark SQL (1.3.0) and am running into long GC pauses which bring the job to
a halt.
I'm
Could you be more specific in how this is done?
A DataFrame class doesn't have that method.
On Sun, May 3, 2015 at 11:07 PM, ayan guha guha.a...@gmail.com wrote:
You can use custom partitioner to redistribution using partitionby
On 4 May 2015 15:37, Nick Travers n.e.trav...@gmail.com wrote:
In regards to the large GC pauses, assuming you allocated all 100GB of
memory per worker you may consider running with less memory on your Worker
nodes, or splitting up the available memory on the Worker nodes amongst
several worker instances. The JVM's garbage collection starts to become
very
If you data is evenly distributed (i.e. no skewed datapoints in your join
keys), it can also help to increase spark.sql.shuffle.partitions (default
is 200).
On Mon, May 4, 2015 at 8:03 AM, Richard Marscher rmarsc...@localytics.com
wrote:
In regards to the large GC pauses, assuming you allocated
10 matches
Mail list logo