Re: How to handle imbalanced data in hadoop ?

Todd Lipcon Wed, 18 Nov 2009 14:55:55 -0800

Hi Pankil,

Thanks for sending these along. I'll try to block out some time this week to
take a look.


-Todd

On Wed, Nov 18, 2009 at 11:16 AM, Pankil Doshi <forpan...@gmail.com> wrote:

> Hey  Todd,
>
> I will attach dataset and java source used by me. Make sure you use with 10
> reducers and also use partitioner class as I have provided.
>
> Dataset-1 has smaller key length
> Dataset-2 has larger key length
>
> When I experiment with both dataset , According my partitioner class
> Reducer 9 (i.e 10 if start with 1) gets all 100000 keys same and so it take
> maximum amount of time in all reducers.( like 17 mins) where as remaining
> all reducers also get 100000 keys but they all are not same , and these
> reducers get over in (1 min 30 sec on avg).
>
> Pankil
>
>
> On Tue, Nov 17, 2009 at 5:07 PM, Todd Lipcon <t...@cloudera.com> wrote:
>
>> On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <forpan...@gmail.com>
>> wrote:
>>
>> > With respect to Imbalanced data, Can anyone guide me how sorting takes
>> > place
>> > in Hadoop after Map phase.
>> >
>> > I did some experiments and found that if there are two reducers which
>> have
>> > same number of keys to sort and one reducer has all the keys same and
>> other
>> > have different keys then time taken by by the reducer having all keys
>> same
>> > is terribly large then other one.
>> >
>> >
>> Hi Pankil,
>>
>> This is an interesting experiment you've done with results that I wouldn't
>> quite expect. Do you have the java source available that you used to run
>> this experiment?
>>
>>
>>
>> > Also I found that length on my Key doesnt matter in the time taken to
>> sort
>> > it.
>> >
>> >
>> With small keys on CPU-bound workload this is probably the case since the
>> sort would be dominated by comparison. If you were to benchmark keys that
>> are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a difference.
>>
>>
>> > I wanted some hints how sorting is done ..
>> >
>>
>> MapTask.java, ReduceTask.java, and Merger.java are the key places to look.
>> The actual sort is a relatively basic quicksort, but there is plenty of
>> complexity in the spill/shuffle/merge logic.
>>
>> -Todd
>>
>>
>>
>> >
>> > Pankil
>> >
>> > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <ham...@cloudera.com
>> > >wrote:
>> >
>> > > Hey Jeff,
>> > >
>> > > You may be interested in the Skewed Design specification from the Pig
>> > team:
>> > > http://wiki.apache.org/pig/PigSkewedJoinSpec.
>> > >
>> > > Regards,
>> > > Jeff
>> > >
>> > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xcolw...@gmail.com>
>> > wrote:
>> > >
>> > > > My first thought is that it depends on the reduce logic. If you
>> could
>> > do
>> > > > the
>> > > > reduction in two passes then you could do an initial arbitrary
>> > partition
>> > > > for
>> > > > the majority key and bring the partitions together in a second
>> > reduction
>> > > > (or
>> > > > a map-side join). I would use a round robin strategy to assign the
>> > > > arbitrary
>> > > > partitions.
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zjf...@gmail.com>
>> wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > Today there's a problem about imbalanced data come out of mind .
>> > > > >
>> > > > > I'd like to know how hadoop handle this kind of data.  e.g. one
>> key
>> > > > > dominates the map output, say 99%. So 99% data set will go to one
>> > > > reducer,
>> > > > > and this reducer will become the bottleneck.
>> > > > >
>> > > > > Does hadoop have any other better ways to handle such imbalanced
>> data
>> > > set
>> > > > ?
>> > > > >
>> > > > >
>> > > > > Jeff Zhang
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: How to handle imbalanced data in hadoop ?

Reply via email to