If you data is evenly distributed (i.e. no skewed datapoints in your join
keys), it can also help to increase spark.sql.shuffle.partitions (default
is 200).

On Mon, May 4, 2015 at 8:03 AM, Richard Marscher <rmarsc...@localytics.com>
wrote:

> In regards to the large GC pauses, assuming you allocated all 100GB of
> memory per worker you may consider running with less memory on your Worker
> nodes, or splitting up the available memory on the Worker nodes amongst
> several worker instances. The JVM's garbage collection starts to become
> very slow as the memory allocation for the heap becomes large. At 100GB it
> may not be unusual to see GC take minutes at time. I believe with Yarn or
> Standalone clusters you should be able to run multiple smaller JVM
> instances on your workers so you can still use your cluster resources but
> also won't have a single JVM allocating an unwieldy amount of much memory.
>
> On Mon, May 4, 2015 at 2:23 AM, Nick Travers <n.e.trav...@gmail.com>
> wrote:
>
>> Could you be more specific in how this is done?
>>
>> A DataFrame class doesn't have that method.
>>
>> On Sun, May 3, 2015 at 11:07 PM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> You can use custom partitioner to redistribution using partitionby
>>> On 4 May 2015 15:37, "Nick Travers" <n.e.trav...@gmail.com> wrote:
>>>
>>>> I'm currently trying to join two large tables (order 1B rows each) using
>>>> Spark SQL (1.3.0) and am running into long GC pauses which bring the
>>>> job to
>>>> a halt.
>>>>
>>>> I'm reading in both tables using a HiveContext with the underlying files
>>>> stored as Parquet Files. I'm using  something along the lines of
>>>> HiveContext.sql("SELECT a.col1, b.col2 FROM a JOIN b ON a.col1 =
>>>> b.col1") to
>>>> set up the join.
>>>>
>>>> When I execute this (with an action such as .count) I see the first few
>>>> stages complete, but the job eventually stalls. The GC counts keep
>>>> increasing for each executor.
>>>>
>>>> Running with 6 workers, each with 2T disk and 100GB RAM.
>>>>
>>>> Has anyone else run into this issue? I'm thinking I might be running
>>>> into
>>>> issues with the shuffling of the data, but I'm unsure of how to get
>>>> around
>>>> this? Is there a way to redistribute the rows based on the join key
>>>> first,
>>>> and then do the join?
>>>>
>>>> Thanks in advance.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>
>

Reply via email to