You can use custom partitioner to redistribution using partitionby
On 4 May 2015 15:37, "Nick Travers" <n.e.trav...@gmail.com> wrote:

> I'm currently trying to join two large tables (order 1B rows each) using
> Spark SQL (1.3.0) and am running into long GC pauses which bring the job to
> a halt.
>
> I'm reading in both tables using a HiveContext with the underlying files
> stored as Parquet Files. I'm using  something along the lines of
> HiveContext.sql("SELECT a.col1, b.col2 FROM a JOIN b ON a.col1 = b.col1")
> to
> set up the join.
>
> When I execute this (with an action such as .count) I see the first few
> stages complete, but the job eventually stalls. The GC counts keep
> increasing for each executor.
>
> Running with 6 workers, each with 2T disk and 100GB RAM.
>
> Has anyone else run into this issue? I'm thinking I might be running into
> issues with the shuffling of the data, but I'm unsure of how to get around
> this? Is there a way to redistribute the rows based on the join key first,
> and then do the join?
>
> Thanks in advance.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to