I'm currently trying to join two large tables (order 1B rows each) using
Spark SQL (1.3.0) and am running into long GC pauses which bring the job to
a halt.

I'm reading in both tables using a HiveContext with the underlying files
stored as Parquet Files. I'm using  something along the lines of
HiveContext.sql("SELECT a.col1, b.col2 FROM a JOIN b ON a.col1 = b.col1") to
set up the join.

When I execute this (with an action such as .count) I see the first few
stages complete, but the job eventually stalls. The GC counts keep
increasing for each executor.

Running with 6 workers, each with 2T disk and 100GB RAM.

Has anyone else run into this issue? I'm thinking I might be running into
issues with the shuffling of the data, but I'm unsure of how to get around
this? Is there a way to redistribute the rows based on the join key first,
and then do the join?

Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to