Re: How to handle imbalanced data in hadoop ?

Amogh Vasekar Wed, 18 Nov 2009 03:56:11 -0800

Hi,
This is the time for all three phases of reducer right?
I think its due to the constant spilling for a single key to disk since the map 
partitions couldn't be held in-mem due to buffer limit. Did the other reducer 
have numerous keys with low number of values ( ie smaller partitions? )


Thanks,
Amogh


On 11/18/09 3:37 AM, "Todd Lipcon" <t...@cloudera.com> wrote:

On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <forpan...@gmail.com> wrote:

> With respect to Imbalanced data, Can anyone guide me how sorting takes
> place
> in Hadoop after Map phase.
>
> I did some experiments and found that if there are two reducers which have
> same number of keys to sort and one reducer has all the keys same and other
> have different keys then time taken by by the reducer having all keys same
> is terribly large then other one.
>
>
Hi Pankil,

This is an interesting experiment you've done with results that I wouldn't
quite expect. Do you have the java source available that you used to run
this experiment?



> Also I found that length on my Key doesnt matter in the time taken to sort
> it.
>
>
With small keys on CPU-bound workload this is probably the case since the
sort would be dominated by comparison. If you were to benchmark keys that
are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a difference.


> I wanted some hints how sorting is done ..
>

MapTask.java, ReduceTask.java, and Merger.java are the key places to look.
The actual sort is a relatively basic quicksort, but there is plenty of
complexity in the spill/shuffle/merge logic.

-Todd



>
> Pankil
>
> On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <ham...@cloudera.com
> >wrote:
>
> > Hey Jeff,
> >
> > You may be interested in the Skewed Design specification from the Pig
> team:
> > http://wiki.apache.org/pig/PigSkewedJoinSpec.
> >
> > Regards,
> > Jeff
> >
> > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xcolw...@gmail.com>
> wrote:
> >
> > > My first thought is that it depends on the reduce logic. If you could
> do
> > > the
> > > reduction in two passes then you could do an initial arbitrary
> partition
> > > for
> > > the majority key and bring the partitions together in a second
> reduction
> > > (or
> > > a map-side join). I would use a round robin strategy to assign the
> > > arbitrary
> > > partitions.
> > >
> > >
> > >
> > >
> > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zjf...@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Today there's a problem about imbalanced data come out of mind .
> > > >
> > > > I'd like to know how hadoop handle this kind of data.  e.g. one key
> > > > dominates the map output, say 99%. So 99% data set will go to one
> > > reducer,
> > > > and this reducer will become the bottleneck.
> > > >
> > > > Does hadoop have any other better ways to handle such imbalanced data
> > set
> > > ?
> > > >
> > > >
> > > > Jeff Zhang
> > > >
> > >
> >
>

Re: How to handle imbalanced data in hadoop ?

Reply via email to