Doug pointed out that chaining reduces like this is a bad idea. The reasoning is that reliability is several compromised because map output is stored locally during the sort phase so that small errors cause serious problems that may entail a large amount of rework.
The preferred implementation for chained reduces is to simply use multiple map/reduce phases where the map in the later phases is just the identity function (or a field permutation which selects a different key). As I have gained more experience, I am finding that it is very, very common to have a strong decrease in size of data as you move down the chain. This is because counting (or something similar) is a very common operation and counting compresses the heck out of your data. This compression means that what happens in the downstream phases just doesn't much matter. On 9/7/07 9:12 AM, "C G" <[EMAIL PROTECTED]> wrote: > I've seen some traffic where people discuss using multiple reduces, and I'd > like to understand more about this. > > If you do multiple reduces, does that mean from a data flow perspective: > > map() -> reduce0() -> reduce1() ->...->reduceN-1() -> reduceN() ? > > From an implementation point of view, how do you go about setting up > multiple reduces? > > Thanks for any advice or pointers to info... > C G > > > --------------------------------- > Luggage? GPS? Comic books? > Check out fitting gifts for grads at Yahoo! Search.