> If that's true, then can I set the number of Reducers very high > (even equal to the number of maps) to make Job C go faster?
This page has some good info on finding the right number of reducers: http://wiki.apache.org/hadoop/HowManyMapsAndReduces / Per On Fri, Sep 19, 2008 at 9:42 AM, Miles Osborne <[EMAIL PROTECTED]> wrote: > > > So here's my question -- does Hadoop guarantee that all records with > the same key will end up in the same Reducer task? If that's true, > > > > yes --think of the record as being sent to the task by hashing over the key > > Miles > > 2008/9/19 Stuart Sierra <[EMAIL PROTECTED]>: > > Hi all, > > The short version of my question is in the subject. Here's the long > version: > > I have two map/reduce jobs that output records using a common key: > > > > Job A: > > K1 => A1,1 > > K1 => A1,2 > > K2 => A2,1 > > K2 => A2,2 > > > > Job B: > > K1 => B1 > > K2 => B2 > > K3 => B3 > > > > And a third job that merges records with the same key, using > > IdentityMapper and a custom Reducer: > > > > Job C: > > K1 => A1,1; A2,2; B1 > > K2 => A2,1; A2,2; B2 > > K3 => B3 > > > > The trouble is, the A's and B's are large (20-30 KB each) and I have a > > few million of them. If Job C has only one Reducer task, it takes > > forever to copy and sort all the records. > > > > Thanks for any enlightenment you can provide here, > > -Stuart > > > >