A better solution is to output a large number of records with keys drawn
pseudo-randomly from a large domain of values.  Then records will balance
across however many reducers you have.  Because you are balancing a larger
number of records, the degree of imbalance that happens at random will be
much smaller than if you have just a few records.

The original poster might not have that option in his problem as stated, but
it sounded like he might be able to restructure the computation a bit.

On Wed, Sep 16, 2009 at 4:08 AM, <[email protected]> wrote:

> Hi,
>
> Based on the data distribution, the hashcode generated by key.hashCode()
> could result in a large skew in the data provided to the reducer function.
> So one reducer might get a very large data set while other reducers get
> small datasets. Thereby, making the job to wait till the busiest reducer
> finishes off.
>
> Is there a way to split the partition files, based on the size of each
> partition.
>
> Thanks!
> Amol.
>
>
> > Thanks,
> >
> > I will try what you suggested.
> >
> > Best,
> >
> > On Wed, Sep 16, 2009 at 2:59 AM, Harish Mallipeddi <
> > [email protected]> wrote:
> >
> >> On Wed, Sep 16, 2009 at 12:54 PM, Anh Nguyen <[email protected]
> >> >wrote:
> >>
> >> > Hi all,
> >> >
> >> > I am having some trouble with distributing workload evenly to
> >> reducers.
> >> >
> >> > I have 25 reducers and I intentionally created 25 different Map output
> >> keys
> >> > so that each output set will go to one Reducer.
> >> >
> >> > But in practice, some Reducers get 2 sets and some does not get
> >> anything.
> >> >
> >> > I wonder if there is a way to fix this. Perhaps a custom Map output
> >> class?
> >> >
> >> > Any help is greatly appreciated.
> >> >
> >> >
> >> The default HashPartitioner does this: (key.hashCode() &
> >> Integer.MAX_VALUE)
> >> % numReduceTasks
> >>
> >> So there's no guarantee your 25 different map-output keys would in fact
> >> end
> >> up in different partitions.
> >> Btw if you want some custom partitioning behavior, just implement the
> >> Partitioner interface in your custom Partitioner class and supply that
> >> to
> >> Hadoop (via JobConf.setPartitionerClass).
> >>
> >>
> >> --
> >> Harish Mallipeddi
> >> http://blog.poundbang.in
> >>
> >
> >
> >
> > --
> > ----------------------------
> > Anh Nguyen
> > http://www.im-nguyen.com
> >
> > --
> > This message has been scanned for viruses and
> > dangerous content by MailScanner, and is
> > believed to be clean.
> >
> >
>
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>


-- 
Ted Dunning, CTO
DeepDyve

Reply via email to