Hi, Based on the data distribution, the hashcode generated by key.hashCode() could result in a large skew in the data provided to the reducer function. So one reducer might get a very large data set while other reducers get small datasets. Thereby, making the job to wait till the busiest reducer finishes off.
Is there a way to split the partition files, based on the size of each partition. Thanks! Amol. > Thanks, > > I will try what you suggested. > > Best, > > On Wed, Sep 16, 2009 at 2:59 AM, Harish Mallipeddi < > [email protected]> wrote: > >> On Wed, Sep 16, 2009 at 12:54 PM, Anh Nguyen <[email protected] >> >wrote: >> >> > Hi all, >> > >> > I am having some trouble with distributing workload evenly to >> reducers. >> > >> > I have 25 reducers and I intentionally created 25 different Map output >> keys >> > so that each output set will go to one Reducer. >> > >> > But in practice, some Reducers get 2 sets and some does not get >> anything. >> > >> > I wonder if there is a way to fix this. Perhaps a custom Map output >> class? >> > >> > Any help is greatly appreciated. >> > >> > >> The default HashPartitioner does this: (key.hashCode() & >> Integer.MAX_VALUE) >> % numReduceTasks >> >> So there's no guarantee your 25 different map-output keys would in fact >> end >> up in different partitions. >> Btw if you want some custom partitioning behavior, just implement the >> Partitioner interface in your custom Partitioner class and supply that >> to >> Hadoop (via JobConf.setPartitionerClass). >> >> >> -- >> Harish Mallipeddi >> http://blog.poundbang.in >> > > > > -- > ---------------------------- > Anh Nguyen > http://www.im-nguyen.com > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
