If you data is evenly distributed (i.e. no skewed datapoints in your join keys), it can also help to increase spark.sql.shuffle.partitions (default is 200).
On Mon, May 4, 2015 at 8:03 AM, Richard Marscher <rmarsc...@localytics.com> wrote: > In regards to the large GC pauses, assuming you allocated all 100GB of > memory per worker you may consider running with less memory on your Worker > nodes, or splitting up the available memory on the Worker nodes amongst > several worker instances. The JVM's garbage collection starts to become > very slow as the memory allocation for the heap becomes large. At 100GB it > may not be unusual to see GC take minutes at time. I believe with Yarn or > Standalone clusters you should be able to run multiple smaller JVM > instances on your workers so you can still use your cluster resources but > also won't have a single JVM allocating an unwieldy amount of much memory. > > On Mon, May 4, 2015 at 2:23 AM, Nick Travers <n.e.trav...@gmail.com> > wrote: > >> Could you be more specific in how this is done? >> >> A DataFrame class doesn't have that method. >> >> On Sun, May 3, 2015 at 11:07 PM, ayan guha <guha.a...@gmail.com> wrote: >> >>> You can use custom partitioner to redistribution using partitionby >>> On 4 May 2015 15:37, "Nick Travers" <n.e.trav...@gmail.com> wrote: >>> >>>> I'm currently trying to join two large tables (order 1B rows each) using >>>> Spark SQL (1.3.0) and am running into long GC pauses which bring the >>>> job to >>>> a halt. >>>> >>>> I'm reading in both tables using a HiveContext with the underlying files >>>> stored as Parquet Files. I'm using something along the lines of >>>> HiveContext.sql("SELECT a.col1, b.col2 FROM a JOIN b ON a.col1 = >>>> b.col1") to >>>> set up the join. >>>> >>>> When I execute this (with an action such as .count) I see the first few >>>> stages complete, but the job eventually stalls. The GC counts keep >>>> increasing for each executor. >>>> >>>> Running with 6 workers, each with 2T disk and 100GB RAM. >>>> >>>> Has anyone else run into this issue? I'm thinking I might be running >>>> into >>>> issues with the shuffling of the data, but I'm unsure of how to get >>>> around >>>> this? Is there a way to redistribute the rows based on the join key >>>> first, >>>> and then do the join? >>>> >>>> Thanks in advance. >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >> >