Could you be more specific in how this is done? A DataFrame class doesn't have that method.
On Sun, May 3, 2015 at 11:07 PM, ayan guha <guha.a...@gmail.com> wrote: > You can use custom partitioner to redistribution using partitionby > On 4 May 2015 15:37, "Nick Travers" <n.e.trav...@gmail.com> wrote: > >> I'm currently trying to join two large tables (order 1B rows each) using >> Spark SQL (1.3.0) and am running into long GC pauses which bring the job >> to >> a halt. >> >> I'm reading in both tables using a HiveContext with the underlying files >> stored as Parquet Files. I'm using something along the lines of >> HiveContext.sql("SELECT a.col1, b.col2 FROM a JOIN b ON a.col1 = b.col1") >> to >> set up the join. >> >> When I execute this (with an action such as .count) I see the first few >> stages complete, but the job eventually stalls. The GC counts keep >> increasing for each executor. >> >> Running with 6 workers, each with 2T disk and 100GB RAM. >> >> Has anyone else run into this issue? I'm thinking I might be running into >> issues with the shuffling of the data, but I'm unsure of how to get around >> this? Is there a way to redistribute the rows based on the join key first, >> and then do the join? >> >> Thanks in advance. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >>