Re: How to handle imbalanced data in hadoop ?

Jeff Hammerbacher Sun, 15 Nov 2009 16:25:35 -0800

Hey Jeff,

You may be interested in the Skewed Design specification from the Pig team:
http://wiki.apache.org/pig/PigSkewedJoinSpec.


Regards,
Jeff

On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xcolw...@gmail.com> wrote:

> My first thought is that it depends on the reduce logic. If you could do
> the
> reduction in two passes then you could do an initial arbitrary partition
> for
> the majority key and bring the partitions together in a second reduction
> (or
> a map-side join). I would use a round robin strategy to assign the
> arbitrary
> partitions.
>
>
>
>
> On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>
> > Hi all,
> >
> > Today there's a problem about imbalanced data come out of mind .
> >
> > I'd like to know how hadoop handle this kind of data.  e.g. one key
> > dominates the map output, say 99%. So 99% data set will go to one
> reducer,
> > and this reducer will become the bottleneck.
> >
> > Does hadoop have any other better ways to handle such imbalanced data set
> ?
> >
> >
> > Jeff Zhang
> >
>

Re: How to handle imbalanced data in hadoop ?

Reply via email to