I'm currently trying to join two large tables (order 1B rows each) using Spark SQL (1.3.0) and am running into long GC pauses which bring the job to a halt.
I'm reading in both tables using a HiveContext with the underlying files stored as Parquet Files. I'm using something along the lines of HiveContext.sql("SELECT a.col1, b.col2 FROM a JOIN b ON a.col1 = b.col1") to set up the join. When I execute this (with an action such as .count) I see the first few stages complete, but the job eventually stalls. The GC counts keep increasing for each executor. Running with 6 workers, each with 2T disk and 100GB RAM. Has anyone else run into this issue? I'm thinking I might be running into issues with the shuffling of the data, but I'm unsure of how to get around this? Is there a way to redistribute the rows based on the join key first, and then do the join? Thanks in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org