Hi,

Based on the data distribution, the hashcode generated by key.hashCode()
could result in a large skew in the data provided to the reducer function.
So one reducer might get a very large data set while other reducers get
small datasets. Thereby, making the job to wait till the busiest reducer
finishes off.

Is there a way to split the partition files, based on the size of each
partition.

Thanks!
Amol.


> Thanks,
>
> I will try what you suggested.
>
> Best,
>
> On Wed, Sep 16, 2009 at 2:59 AM, Harish Mallipeddi <
> [email protected]> wrote:
>
>> On Wed, Sep 16, 2009 at 12:54 PM, Anh Nguyen <[email protected]
>> >wrote:
>>
>> > Hi all,
>> >
>> > I am having some trouble with distributing workload evenly to
>> reducers.
>> >
>> > I have 25 reducers and I intentionally created 25 different Map output
>> keys
>> > so that each output set will go to one Reducer.
>> >
>> > But in practice, some Reducers get 2 sets and some does not get
>> anything.
>> >
>> > I wonder if there is a way to fix this. Perhaps a custom Map output
>> class?
>> >
>> > Any help is greatly appreciated.
>> >
>> >
>> The default HashPartitioner does this: (key.hashCode() &
>> Integer.MAX_VALUE)
>> % numReduceTasks
>>
>> So there's no guarantee your 25 different map-output keys would in fact
>> end
>> up in different partitions.
>> Btw if you want some custom partitioning behavior, just implement the
>> Partitioner interface in your custom Partitioner class and supply that
>> to
>> Hadoop (via JobConf.setPartitionerClass).
>>
>>
>> --
>> Harish Mallipeddi
>> http://blog.poundbang.in
>>
>
>
>
> --
> ----------------------------
> Anh Nguyen
> http://www.im-nguyen.com
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Reply via email to