Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

Nick Travers Sun, 03 May 2015 23:26:23 -0700

Could you be more specific in how this is done?

A DataFrame class doesn't have that method.


On Sun, May 3, 2015 at 11:07 PM, ayan guha <guha.a...@gmail.com> wrote:

> You can use custom partitioner to redistribution using partitionby
> On 4 May 2015 15:37, "Nick Travers" <n.e.trav...@gmail.com> wrote:
>
>> I'm currently trying to join two large tables (order 1B rows each) using
>> Spark SQL (1.3.0) and am running into long GC pauses which bring the job
>> to
>> a halt.
>>
>> I'm reading in both tables using a HiveContext with the underlying files
>> stored as Parquet Files. I'm using  something along the lines of
>> HiveContext.sql("SELECT a.col1, b.col2 FROM a JOIN b ON a.col1 = b.col1")
>> to
>> set up the join.
>>
>> When I execute this (with an action such as .count) I see the first few
>> stages complete, but the job eventually stalls. The GC counts keep
>> increasing for each executor.
>>
>> Running with 6 workers, each with 2T disk and 100GB RAM.
>>
>> Has anyone else run into this issue? I'm thinking I might be running into
>> issues with the shuffling of the data, but I'm unsure of how to get around
>> this? Is there a way to redistribute the rows based on the join key first,
>> and then do the join?
>>
>> Thanks in advance.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

Reply via email to