A better solution is to output a large number of records with keys drawn pseudo-randomly from a large domain of values. Then records will balance across however many reducers you have. Because you are balancing a larger number of records, the degree of imbalance that happens at random will be much smaller than if you have just a few records.
The original poster might not have that option in his problem as stated, but it sounded like he might be able to restructure the computation a bit. On Wed, Sep 16, 2009 at 4:08 AM, <[email protected]> wrote: > Hi, > > Based on the data distribution, the hashcode generated by key.hashCode() > could result in a large skew in the data provided to the reducer function. > So one reducer might get a very large data set while other reducers get > small datasets. Thereby, making the job to wait till the busiest reducer > finishes off. > > Is there a way to split the partition files, based on the size of each > partition. > > Thanks! > Amol. > > > > Thanks, > > > > I will try what you suggested. > > > > Best, > > > > On Wed, Sep 16, 2009 at 2:59 AM, Harish Mallipeddi < > > [email protected]> wrote: > > > >> On Wed, Sep 16, 2009 at 12:54 PM, Anh Nguyen <[email protected] > >> >wrote: > >> > >> > Hi all, > >> > > >> > I am having some trouble with distributing workload evenly to > >> reducers. > >> > > >> > I have 25 reducers and I intentionally created 25 different Map output > >> keys > >> > so that each output set will go to one Reducer. > >> > > >> > But in practice, some Reducers get 2 sets and some does not get > >> anything. > >> > > >> > I wonder if there is a way to fix this. Perhaps a custom Map output > >> class? > >> > > >> > Any help is greatly appreciated. > >> > > >> > > >> The default HashPartitioner does this: (key.hashCode() & > >> Integer.MAX_VALUE) > >> % numReduceTasks > >> > >> So there's no guarantee your 25 different map-output keys would in fact > >> end > >> up in different partitions. > >> Btw if you want some custom partitioning behavior, just implement the > >> Partitioner interface in your custom Partitioner class and supply that > >> to > >> Hadoop (via JobConf.setPartitionerClass). > >> > >> > >> -- > >> Harish Mallipeddi > >> http://blog.poundbang.in > >> > > > > > > > > -- > > ---------------------------- > > Anh Nguyen > > http://www.im-nguyen.com > > > > -- > > This message has been scanned for viruses and > > dangerous content by MailScanner, and is > > believed to be clean. > > > > > > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > -- Ted Dunning, CTO DeepDyve
