Can you try repartitioning the rdd after creating the K,V. And also, while calling the rdd1.join(rdd2, Pass the # partition argument too)
Thanks Best Regards On Wed, Jun 17, 2015 at 12:15 PM, Al M <alasdair.mcbr...@gmail.com> wrote: > I have 2 RDDs I want to Join. We will call them RDD A and RDD B. RDD A > has > 1 billion rows; RDD B has 100k rows. I want to join them on a single key. > > 95% of the rows in RDD A have the same key to join with RDD B. Before I > can > join the two RDDs, I must map them to tuples where the first element is the > key and the second is the value. > > Since 95% of the rows in RDD A have the same key, they now go into the same > partition. When I perform the join, the system will try to execute this > partition in just one task. This one task will try to load too much data > into memory at once and die a horrible death. > > I know that this is caused by the HashPartitioner that is used by default > in > Spark; everything with the same hashed key goes into the same partition. I > also tried the RangePartitioner but still saw 95% of the data go into the > same partition. What I'd really like is a partitioner that puts everything > with the same key into the same partition *except* when the partition is > over a certain size, then it would just spill into the next partition. > > Writing my own partitioner is a big job, and requires a lot of testing to > make sure I get it right. Is there a simpler way to solve this? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-produces-one-huge-partition-tp23358.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >