Hi Xinh, Thanks! Custom partitioner with partitionBy() did the job.
On Tue, May 10, 2016 at 11:36 PM, Xinh Huynh <xinh.hu...@gmail.com> wrote: > Hi Ayman, > > Have you looked at this: > http://stackoverflow.com/questions/23127329/how-to-define-custom-partitioner-for-spark-rdds-of-equally-sized-partition-where > > It recommends defining a custom partitioner and (PairRDD) partitionBy > method to accomplish this. > > Xinh > > On Tue, May 10, 2016 at 1:15 PM, Ayman Khalil <aymkhali...@gmail.com> > wrote: > >> And btw, I'm using the Python API if this makes any difference. >> >> On Tue, May 10, 2016 at 11:14 PM, Ayman Khalil <aymkhali...@gmail.com> >> wrote: >> >>> Hi Don, >>> >>> This didn't help. My original rdd is already created using 10 >>> partitions. As a matter of fact, after trying with rdd.coalesce(10, >>> shuffle = true) out of curiosity, the rdd partitions became even more >>> imbalanced: >>> [(0, 5120), (1, 5120), (2, 5120), (3, 5120), (4, *3920*), (5, 4096), >>> (6, 5120), (7, 5120), (8, 5120), (9, *6144*)] >>> >>> >>> On Tue, May 10, 2016 at 10:16 PM, Don Drake <dondr...@gmail.com> wrote: >>> >>>> You can call rdd.coalesce(10, shuffle = true) and the returning rdd >>>> will be evenly balanced. This obviously triggers a shuffle, so be advised >>>> it could be an expensive operation depending on your RDD size. >>>> >>>> -Don >>>> >>>> On Tue, May 10, 2016 at 12:38 PM, Ayman Khalil <aymkhali...@gmail.com> >>>> wrote: >>>> >>>>> Hello, >>>>> >>>>> I have 50,000 items parallelized into an RDD with 10 partitions, I >>>>> would like to evenly split the items over the partitions so: >>>>> 50,000/10 = 5,000 in each RDD partition. >>>>> >>>>> What I get instead is the following (partition index, partition count): >>>>> [(0, 4096), (1, 5120), (2, 5120), (3, 5120), (4, 5120), (5, 5120), (6, >>>>> 5120), (7, 5120), (8, 5120), (9, 4944)] >>>>> >>>>> the total is correct (4096 + 4944 + 8*5120 = 50,000) but the >>>>> partitions are imbalanced. >>>>> >>>>> Is there a way to do that? >>>>> >>>>> Thank you, >>>>> Ayman >>>>> >>>> >>>> >>>> >>>> -- >>>> Donald Drake >>>> Drake Consulting >>>> http://www.drakeconsulting.com/ >>>> https://twitter.com/dondrake <http://www.MailLaunder.com/> >>>> 800-733-2143 >>>> >>> >>> >> >