Hey Jeff, You may be interested in the Skewed Design specification from the Pig team: http://wiki.apache.org/pig/PigSkewedJoinSpec.
Regards, Jeff On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xcolw...@gmail.com> wrote: > My first thought is that it depends on the reduce logic. If you could do > the > reduction in two passes then you could do an initial arbitrary partition > for > the majority key and bring the partitions together in a second reduction > (or > a map-side join). I would use a round robin strategy to assign the > arbitrary > partitions. > > > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zjf...@gmail.com> wrote: > > > Hi all, > > > > Today there's a problem about imbalanced data come out of mind . > > > > I'd like to know how hadoop handle this kind of data. e.g. one key > > dominates the map output, say 99%. So 99% data set will go to one > reducer, > > and this reducer will become the bottleneck. > > > > Does hadoop have any other better ways to handle such imbalanced data set > ? > > > > > > Jeff Zhang > > >