Hi Xinh,

Thanks! Custom partitioner with partitionBy() did the job.


On Tue, May 10, 2016 at 11:36 PM, Xinh Huynh <xinh.hu...@gmail.com> wrote:

> Hi Ayman,
>
> Have you looked at this:
> http://stackoverflow.com/questions/23127329/how-to-define-custom-partitioner-for-spark-rdds-of-equally-sized-partition-where
>
> It recommends defining a custom partitioner and (PairRDD) partitionBy
> method to accomplish this.
>
> Xinh
>
> On Tue, May 10, 2016 at 1:15 PM, Ayman Khalil <aymkhali...@gmail.com>
> wrote:
>
>> And btw, I'm using the Python API if this makes any difference.
>>
>> On Tue, May 10, 2016 at 11:14 PM, Ayman Khalil <aymkhali...@gmail.com>
>> wrote:
>>
>>> Hi Don,
>>>
>>> This didn't help. My original rdd is already created using 10
>>> partitions. As a matter of fact, after trying with rdd.coalesce(10,
>>> shuffle = true) out of curiosity, the rdd partitions became even more
>>> imbalanced:
>>> [(0, 5120), (1, 5120), (2, 5120), (3, 5120), (4, *3920*), (5, 4096),
>>> (6, 5120), (7, 5120), (8, 5120), (9, *6144*)]
>>>
>>>
>>> On Tue, May 10, 2016 at 10:16 PM, Don Drake <dondr...@gmail.com> wrote:
>>>
>>>> You can call rdd.coalesce(10, shuffle = true) and the returning rdd
>>>> will be evenly balanced.  This obviously triggers a shuffle, so be advised
>>>> it could be an expensive operation depending on your RDD size.
>>>>
>>>> -Don
>>>>
>>>> On Tue, May 10, 2016 at 12:38 PM, Ayman Khalil <aymkhali...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have 50,000 items parallelized into an RDD with 10 partitions, I
>>>>> would like to evenly split the items over the partitions so:
>>>>> 50,000/10 = 5,000 in each RDD partition.
>>>>>
>>>>> What I get instead is the following (partition index, partition count):
>>>>> [(0, 4096), (1, 5120), (2, 5120), (3, 5120), (4, 5120), (5, 5120), (6,
>>>>> 5120), (7, 5120), (8, 5120), (9, 4944)]
>>>>>
>>>>> the total is correct (4096 + 4944 + 8*5120 = 50,000) but the
>>>>> partitions are imbalanced.
>>>>>
>>>>> Is there a way to do that?
>>>>>
>>>>> Thank you,
>>>>> Ayman
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Donald Drake
>>>> Drake Consulting
>>>> http://www.drakeconsulting.com/
>>>> https://twitter.com/dondrake <http://www.MailLaunder.com/>
>>>> 800-733-2143
>>>>
>>>
>>>
>>
>

Reply via email to