Can you try repartitioning the rdd after creating the K,V. And also, while
calling the rdd1.join(rdd2, Pass the # partition argument too)

Thanks
Best Regards

On Wed, Jun 17, 2015 at 12:15 PM, Al M <alasdair.mcbr...@gmail.com> wrote:

> I have 2 RDDs I want to Join.  We will call them RDD A and RDD B.  RDD A
> has
> 1 billion rows; RDD B has 100k rows.  I want to join them on a single key.
>
> 95% of the rows in RDD A have the same key to join with RDD B.  Before I
> can
> join the two RDDs, I must map them to tuples where the first element is the
> key and the second is the value.
>
> Since 95% of the rows in RDD A have the same key, they now go into the same
> partition.  When I perform the join, the system will try to execute this
> partition in just one task.  This one task will try to load too much data
> into memory at once and die a horrible death.
>
> I know that this is caused by the HashPartitioner that is used by default
> in
> Spark; everything with the same hashed key goes into the same partition.  I
> also tried the RangePartitioner but still saw 95% of the data go into the
> same partition.  What I'd really like is a partitioner that puts everything
> with the same key into the same partition *except* when the partition is
> over a certain size, then it would just spill into the next partition.
>
> Writing my own partitioner is a big job, and requires a lot of testing to
> make sure I get it right.  Is there a simpler way to solve this?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-produces-one-huge-partition-tp23358.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to