Re: Multiple reduces...

Ted Dunning Fri, 07 Sep 2007 09:52:30 -0700

Doug pointed out that chaining reduces like this is a bad idea.

The reasoning is that reliability is several compromised because map output
is stored locally during the sort phase so that small errors cause serious
problems that may entail a large amount of rework.

The preferred implementation for chained reduces is to simply use multiple
map/reduce phases where the map in the later phases is just the identity
function (or a field permutation which selects a different key).

As I have gained more experience, I am finding that it is very, very common
to have a strong decrease in size of data as you move down the chain.  This
is because counting (or something similar) is a very common operation and
counting compresses the heck out of your data.

This compression means that what happens in the downstream phases just
doesn't much matter.

On 9/7/07 9:12 AM, "C G" <[EMAIL PROTECTED]> wrote:

> I've seen some traffic where people discuss using multiple reduces, and I'd
> like to understand more about this.
>    
>   If you do multiple reduces, does that mean from a data flow perspective:
>    
>       map()  -> reduce0() -> reduce1() ->...->reduceN-1() -> reduceN()   ?
>    
>   From an implementation point of view, how do you go about setting up
> multiple reduces?
>    
>   Thanks for any advice or pointers to info...
>   C G
> 
>        
> ---------------------------------
> Luggage? GPS? Comic books?
> Check out fitting  gifts for grads at Yahoo! Search.

Re: Multiple reduces...

Reply via email to