Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

Michael Armbrust Mon, 04 May 2015 15:00:06 -0700

If you data is evenly distributed (i.e. no skewed datapoints in your join
keys), it can also help to increase spark.sql.shuffle.partitions (default
is 200).


On Mon, May 4, 2015 at 8:03 AM, Richard Marscher <rmarsc...@localytics.com>
wrote:

> In regards to the large GC pauses, assuming you allocated all 100GB of
> memory per worker you may consider running with less memory on your Worker
> nodes, or splitting up the available memory on the Worker nodes amongst
> several worker instances. The JVM's garbage collection starts to become
> very slow as the memory allocation for the heap becomes large. At 100GB it
> may not be unusual to see GC take minutes at time. I believe with Yarn or
> Standalone clusters you should be able to run multiple smaller JVM
> instances on your workers so you can still use your cluster resources but
> also won't have a single JVM allocating an unwieldy amount of much memory.
>
> On Mon, May 4, 2015 at 2:23 AM, Nick Travers <n.e.trav...@gmail.com>
> wrote:
>
>> Could you be more specific in how this is done?
>>
>> A DataFrame class doesn't have that method.
>>
>> On Sun, May 3, 2015 at 11:07 PM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> You can use custom partitioner to redistribution using partitionby
>>> On 4 May 2015 15:37, "Nick Travers" <n.e.trav...@gmail.com> wrote:
>>>
>>>> I'm currently trying to join two large tables (order 1B rows each) using
>>>> Spark SQL (1.3.0) and am running into long GC pauses which bring the
>>>> job to
>>>> a halt.
>>>>
>>>> I'm reading in both tables using a HiveContext with the underlying files
>>>> stored as Parquet Files. I'm using  something along the lines of
>>>> HiveContext.sql("SELECT a.col1, b.col2 FROM a JOIN b ON a.col1 =
>>>> b.col1") to
>>>> set up the join.
>>>>
>>>> When I execute this (with an action such as .count) I see the first few
>>>> stages complete, but the job eventually stalls. The GC counts keep
>>>> increasing for each executor.
>>>>
>>>> Running with 6 workers, each with 2T disk and 100GB RAM.
>>>>
>>>> Has anyone else run into this issue? I'm thinking I might be running
>>>> into
>>>> issues with the shuffling of the data, but I'm unsure of how to get
>>>> around
>>>> this? Is there a way to redistribute the rows based on the join key
>>>> first,
>>>> and then do the join?
>>>>
>>>> Thanks in advance.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>
>

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

Reply via email to