Re: How to handle imbalanced data in hadoop ?

Pankil Doshi Tue, 17 Nov 2009 13:55:14 -0800

With respect to Imbalanced data, Can anyone guide me how sorting takes place
in Hadoop after Map phase.


I did some experiments and found that if there are two reducers which have
same number of keys to sort and one reducer has all the keys same and other
have different keys then time taken by by the reducer having all keys same
is terribly large then other one.

Also I found that length on my Key doesnt matter in the time taken to sort
it.

I wanted some hints how sorting is done ..

Pankil

On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <ham...@cloudera.com>wrote:

> Hey Jeff,
>
> You may be interested in the Skewed Design specification from the Pig team:
> http://wiki.apache.org/pig/PigSkewedJoinSpec.
>
> Regards,
> Jeff
>
> On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xcolw...@gmail.com> wrote:
>
> > My first thought is that it depends on the reduce logic. If you could do
> > the
> > reduction in two passes then you could do an initial arbitrary partition
> > for
> > the majority key and bring the partitions together in a second reduction
> > (or
> > a map-side join). I would use a round robin strategy to assign the
> > arbitrary
> > partitions.
> >
> >
> >
> >
> > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zjf...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > Today there's a problem about imbalanced data come out of mind .
> > >
> > > I'd like to know how hadoop handle this kind of data.  e.g. one key
> > > dominates the map output, say 99%. So 99% data set will go to one
> > reducer,
> > > and this reducer will become the bottleneck.
> > >
> > > Does hadoop have any other better ways to handle such imbalanced data
> set
> > ?
> > >
> > >
> > > Jeff Zhang
> > >
> >
>

Re: How to handle imbalanced data in hadoop ?

Reply via email to