You can use custom partitioner to redistribution using partitionby On 4 May 2015 15:37, "Nick Travers" <n.e.trav...@gmail.com> wrote:
> I'm currently trying to join two large tables (order 1B rows each) using > Spark SQL (1.3.0) and am running into long GC pauses which bring the job to > a halt. > > I'm reading in both tables using a HiveContext with the underlying files > stored as Parquet Files. I'm using something along the lines of > HiveContext.sql("SELECT a.col1, b.col2 FROM a JOIN b ON a.col1 = b.col1") > to > set up the join. > > When I execute this (with an action such as .count) I see the first few > stages complete, but the job eventually stalls. The GC counts keep > increasing for each executor. > > Running with 6 workers, each with 2T disk and 100GB RAM. > > Has anyone else run into this issue? I'm thinking I might be running into > issues with the shuffling of the data, but I'm unsure of how to get around > this? Is there a way to redistribute the rows based on the join key first, > and then do the join? > > Thanks in advance. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >