Re: How to handle imbalanced data in hadoop ?

Jeff Zhang Tue, 24 Nov 2009 08:10:21 -0800

agree, the worst case for quick sort is O(n2) if all the data is the same.

and in this case the running for heapsort is O(n*log n)


Jeff Zhang


On Tue, Nov 24, 2009 at 7:35 AM, Ted Xu <ted.xu...@gmail.com> wrote:

> 2009/11/18 Pankil Doshi <forpan...@gmail.com>
>
> > With respect to Imbalanced data, Can anyone guide me how sorting takes
> > place
> > in Hadoop after Map phase.
> >
> > I did some experiments and found that if there are two reducers which
> have
> > same number of keys to sort and one reducer has all the keys same and
> other
> > have different keys then time taken by by the reducer having all keys
> same
> > is terribly large then other one.
> >
>
> Maybe that is caused by the sorting algorithm.
>
> It is make sence if we consider the default sort algorithm is quick sort.
> In quick sort if
> all the sorting data is quite the same, sort iteration will be very deep.
>
> Would you please try changing the job configuration "map.sort.class" to be
> org.apache.hadoop.util.HeapSort?
>
>
> > Also I found that length on my Key doesnt matter in the time taken to
> sort
> > it.
> >
> > I wanted some hints how sorting is done ..
> >
> > Pankil
> >
> > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <ham...@cloudera.com
> > >wrote:
> >
> > > Hey Jeff,
> > >
> > > You may be interested in the Skewed Design specification from the Pig
> > team:
> > > http://wiki.apache.org/pig/PigSkewedJoinSpec.
> > >
> > > Regards,
> > > Jeff
> > >
> > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xcolw...@gmail.com>
> > wrote:
> > >
> > > > My first thought is that it depends on the reduce logic. If you could
> > do
> > > > the
> > > > reduction in two passes then you could do an initial arbitrary
> > partition
> > > > for
> > > > the majority key and bring the partitions together in a second
> > reduction
> > > > (or
> > > > a map-side join). I would use a round robin strategy to assign the
> > > > arbitrary
> > > > partitions.
> > > >
> > > >
> > > >
> > > >
> > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zjf...@gmail.com>
> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Today there's a problem about imbalanced data come out of mind .
> > > > >
> > > > > I'd like to know how hadoop handle this kind of data.  e.g. one key
> > > > > dominates the map output, say 99%. So 99% data set will go to one
> > > > reducer,
> > > > > and this reducer will become the bottleneck.
> > > > >
> > > > > Does hadoop have any other better ways to handle such imbalanced
> data
> > > set
> > > > ?
> > > > >
> > > > >
> > > > > Jeff Zhang
> > > > >
> > > >
> > >
> >
>

Re: How to handle imbalanced data in hadoop ?

Reply via email to