I don't disagree, but as I noted above, the majority of the issue Pankil saw was an inefficient reducer that was extremely slow for large value sets for a given key. When I fixed that issue, the sort time was within the noise margin between the "all the same" and "9 different values" reducers.
-Todd On Tue, Nov 24, 2009 at 8:09 AM, Jeff Zhang <zjf...@gmail.com> wrote: > agree, the worst case for quick sort is O(n2) if all the data is the same. > > and in this case the running for heapsort is O(n*log n) > > Jeff Zhang > > > On Tue, Nov 24, 2009 at 7:35 AM, Ted Xu <ted.xu...@gmail.com> wrote: > > > 2009/11/18 Pankil Doshi <forpan...@gmail.com> > > > > > With respect to Imbalanced data, Can anyone guide me how sorting takes > > > place > > > in Hadoop after Map phase. > > > > > > I did some experiments and found that if there are two reducers which > > have > > > same number of keys to sort and one reducer has all the keys same and > > other > > > have different keys then time taken by by the reducer having all keys > > same > > > is terribly large then other one. > > > > > > > Maybe that is caused by the sorting algorithm. > > > > It is make sence if we consider the default sort algorithm is quick sort. > > In quick sort if > > all the sorting data is quite the same, sort iteration will be very deep. > > > > Would you please try changing the job configuration "map.sort.class" to > be > > org.apache.hadoop.util.HeapSort? > > > > > > > Also I found that length on my Key doesnt matter in the time taken to > > sort > > > it. > > > > > > I wanted some hints how sorting is done .. > > > > > > Pankil > > > > > > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher < > ham...@cloudera.com > > > >wrote: > > > > > > > Hey Jeff, > > > > > > > > You may be interested in the Skewed Design specification from the Pig > > > team: > > > > http://wiki.apache.org/pig/PigSkewedJoinSpec. > > > > > > > > Regards, > > > > Jeff > > > > > > > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xcolw...@gmail.com> > > > wrote: > > > > > > > > > My first thought is that it depends on the reduce logic. If you > could > > > do > > > > > the > > > > > reduction in two passes then you could do an initial arbitrary > > > partition > > > > > for > > > > > the majority key and bring the partitions together in a second > > > reduction > > > > > (or > > > > > a map-side join). I would use a round robin strategy to assign the > > > > > arbitrary > > > > > partitions. > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zjf...@gmail.com> > > wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > Today there's a problem about imbalanced data come out of mind . > > > > > > > > > > > > I'd like to know how hadoop handle this kind of data. e.g. one > key > > > > > > dominates the map output, say 99%. So 99% data set will go to one > > > > > reducer, > > > > > > and this reducer will become the bottleneck. > > > > > > > > > > > > Does hadoop have any other better ways to handle such imbalanced > > data > > > > set > > > > > ? > > > > > > > > > > > > > > > > > > Jeff Zhang > > > > > > > > > > > > > > > > > > > > >