Re: How to handle imbalanced data in hadoop ?

Pankil Doshi Wed, 18 Nov 2009 14:53:27 -0800

Ya thats true. Though it depends on my cluster configuration but still other
reducers (0 to 8) also have 100000 keys to handle in which keys are
different and they get done in (1 min 30 sec on avg).
But 9th reducer getting all 100000 keys same takes 17 mins.


Pankil

On Wed, Nov 18, 2009 at 3:34 PM, Runping Qi <runping...@gmail.com> wrote:

> Is it true that most of the 17 minutes for the reducer with the 100000 same
> keys was taken by the sort phase? If so, that means that the sorting
> algorithm does not handle the special case well.
>
>
> On Wed, Nov 18, 2009 at 11:16 AM, Pankil Doshi <forpan...@gmail.com>
> wrote:
>
> > Hey  Todd,
> >
> > I will attach dataset and java source used by me. Make sure you use with
> 10
> > reducers and also use partitioner class as I have provided.
> >
> > Dataset-1 has smaller key length
> > Dataset-2 has larger key length
> >
> > When I experiment with both dataset , According my partitioner class
> > Reducer 9 (i.e 10 if start with 1) gets all 100000 keys same and so it
> take
> > maximum amount of time in all reducers.( like 17 mins) where as remaining
> > all reducers also get 100000 keys but they all are not same , and these
> > reducers get over in (1 min 30 sec on avg).
> >
> > Pankil
> >
> >
> > On Tue, Nov 17, 2009 at 5:07 PM, Todd Lipcon <t...@cloudera.com> wrote:
> >
> >> On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <forpan...@gmail.com>
> >> wrote:
> >>
> >> > With respect to Imbalanced data, Can anyone guide me how sorting takes
> >> > place
> >> > in Hadoop after Map phase.
> >> >
> >> > I did some experiments and found that if there are two reducers which
> >> have
> >> > same number of keys to sort and one reducer has all the keys same and
> >> other
> >> > have different keys then time taken by by the reducer having all keys
> >> same
> >> > is terribly large then other one.
> >> >
> >> >
> >> Hi Pankil,
> >>
> >> This is an interesting experiment you've done with results that I
> wouldn't
> >> quite expect. Do you have the java source available that you used to run
> >> this experiment?
> >>
> >>
> >>
> >> > Also I found that length on my Key doesnt matter in the time taken to
> >> sort
> >> > it.
> >> >
> >> >
> >> With small keys on CPU-bound workload this is probably the case since
> the
> >> sort would be dominated by comparison. If you were to benchmark keys
> that
> >> are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a
> difference.
> >>
> >>
> >> > I wanted some hints how sorting is done ..
> >> >
> >>
> >> MapTask.java, ReduceTask.java, and Merger.java are the key places to
> look.
> >> The actual sort is a relatively basic quicksort, but there is plenty of
> >> complexity in the spill/shuffle/merge logic.
> >>
> >> -Todd
> >>
> >>
> >>
> >> >
> >> > Pankil
> >> >
> >> > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <
> ham...@cloudera.com
> >> > >wrote:
> >> >
> >> > > Hey Jeff,
> >> > >
> >> > > You may be interested in the Skewed Design specification from the
> Pig
> >> > team:
> >> > > http://wiki.apache.org/pig/PigSkewedJoinSpec.
> >> > >
> >> > > Regards,
> >> > > Jeff
> >> > >
> >> > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xcolw...@gmail.com>
> >> > wrote:
> >> > >
> >> > > > My first thought is that it depends on the reduce logic. If you
> >> could
> >> > do
> >> > > > the
> >> > > > reduction in two passes then you could do an initial arbitrary
> >> > partition
> >> > > > for
> >> > > > the majority key and bring the partitions together in a second
> >> > reduction
> >> > > > (or
> >> > > > a map-side join). I would use a round robin strategy to assign the
> >> > > > arbitrary
> >> > > > partitions.
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zjf...@gmail.com>
> >> wrote:
> >> > > >
> >> > > > > Hi all,
> >> > > > >
> >> > > > > Today there's a problem about imbalanced data come out of mind .
> >> > > > >
> >> > > > > I'd like to know how hadoop handle this kind of data.  e.g. one
> >> key
> >> > > > > dominates the map output, say 99%. So 99% data set will go to
> one
> >> > > > reducer,
> >> > > > > and this reducer will become the bottleneck.
> >> > > > >
> >> > > > > Does hadoop have any other better ways to handle such imbalanced
> >> data
> >> > > set
> >> > > > ?
> >> > > > >
> >> > > > >
> >> > > > > Jeff Zhang
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: How to handle imbalanced data in hadoop ?

Reply via email to