Hi Pankil, Thanks for sending these along. I'll try to block out some time this week to take a look.
-Todd On Wed, Nov 18, 2009 at 11:16 AM, Pankil Doshi <forpan...@gmail.com> wrote: > Hey Todd, > > I will attach dataset and java source used by me. Make sure you use with 10 > reducers and also use partitioner class as I have provided. > > Dataset-1 has smaller key length > Dataset-2 has larger key length > > When I experiment with both dataset , According my partitioner class > Reducer 9 (i.e 10 if start with 1) gets all 100000 keys same and so it take > maximum amount of time in all reducers.( like 17 mins) where as remaining > all reducers also get 100000 keys but they all are not same , and these > reducers get over in (1 min 30 sec on avg). > > Pankil > > > On Tue, Nov 17, 2009 at 5:07 PM, Todd Lipcon <t...@cloudera.com> wrote: > >> On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <forpan...@gmail.com> >> wrote: >> >> > With respect to Imbalanced data, Can anyone guide me how sorting takes >> > place >> > in Hadoop after Map phase. >> > >> > I did some experiments and found that if there are two reducers which >> have >> > same number of keys to sort and one reducer has all the keys same and >> other >> > have different keys then time taken by by the reducer having all keys >> same >> > is terribly large then other one. >> > >> > >> Hi Pankil, >> >> This is an interesting experiment you've done with results that I wouldn't >> quite expect. Do you have the java source available that you used to run >> this experiment? >> >> >> >> > Also I found that length on my Key doesnt matter in the time taken to >> sort >> > it. >> > >> > >> With small keys on CPU-bound workload this is probably the case since the >> sort would be dominated by comparison. If you were to benchmark keys that >> are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a difference. >> >> >> > I wanted some hints how sorting is done .. >> > >> >> MapTask.java, ReduceTask.java, and Merger.java are the key places to look. >> The actual sort is a relatively basic quicksort, but there is plenty of >> complexity in the spill/shuffle/merge logic. >> >> -Todd >> >> >> >> > >> > Pankil >> > >> > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <ham...@cloudera.com >> > >wrote: >> > >> > > Hey Jeff, >> > > >> > > You may be interested in the Skewed Design specification from the Pig >> > team: >> > > http://wiki.apache.org/pig/PigSkewedJoinSpec. >> > > >> > > Regards, >> > > Jeff >> > > >> > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xcolw...@gmail.com> >> > wrote: >> > > >> > > > My first thought is that it depends on the reduce logic. If you >> could >> > do >> > > > the >> > > > reduction in two passes then you could do an initial arbitrary >> > partition >> > > > for >> > > > the majority key and bring the partitions together in a second >> > reduction >> > > > (or >> > > > a map-side join). I would use a round robin strategy to assign the >> > > > arbitrary >> > > > partitions. >> > > > >> > > > >> > > > >> > > > >> > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zjf...@gmail.com> >> wrote: >> > > > >> > > > > Hi all, >> > > > > >> > > > > Today there's a problem about imbalanced data come out of mind . >> > > > > >> > > > > I'd like to know how hadoop handle this kind of data. e.g. one >> key >> > > > > dominates the map output, say 99%. So 99% data set will go to one >> > > > reducer, >> > > > > and this reducer will become the bottleneck. >> > > > > >> > > > > Does hadoop have any other better ways to handle such imbalanced >> data >> > > set >> > > > ? >> > > > > >> > > > > >> > > > > Jeff Zhang >> > > > > >> > > > >> > > >> > >> > >