Re: Do all Mapper outputs with same key go to same Reducer?
> If that's true, then can I set the number of Reducers very high > (even equal to the number of maps) to make Job C go faster? This page has some good info on finding the right number of reducers: http://wiki.apache.org/hadoop/HowManyMapsAndReduces / Per On Fri, Sep 19, 2008 at 9:42 AM, Miles Osborne <[EMAIL PROTECTED]> wrote: > > > So here's my question -- does Hadoop guarantee that all records with > the same key will end up in the same Reducer task? If that's true, > > > > yes --think of the record as being sent to the task by hashing over the key > > Miles > > 2008/9/19 Stuart Sierra <[EMAIL PROTECTED]>: > > Hi all, > > The short version of my question is in the subject. Here's the long > version: > > I have two map/reduce jobs that output records using a common key: > > > > Job A: > > K1 => A1,1 > > K1 => A1,2 > > K2 => A2,1 > > K2 => A2,2 > > > > Job B: > > K1 => B1 > > K2 => B2 > > K3 => B3 > > > > And a third job that merges records with the same key, using > > IdentityMapper and a custom Reducer: > > > > Job C: > > K1 => A1,1; A2,2; B1 > > K2 => A2,1; A2,2; B2 > > K3 => B3 > > > > The trouble is, the A's and B's are large (20-30 KB each) and I have a > > few million of them. If Job C has only one Reducer task, it takes > > forever to copy and sort all the records. > > > > Thanks for any enlightenment you can provide here, > > -Stuart > > > >
Re: Do all Mapper outputs with same key go to same Reducer?
> So here's my question -- does Hadoop guarantee that all records with the same key will end up in the same Reducer task? If that's true, > yes --think of the record as being sent to the task by hashing over the key Miles 2008/9/19 Stuart Sierra <[EMAIL PROTECTED]>: > Hi all, > The short version of my question is in the subject. Here's the long version: > I have two map/reduce jobs that output records using a common key: > > Job A: > K1 => A1,1 > K1 => A1,2 > K2 => A2,1 > K2 => A2,2 > > Job B: > K1 => B1 > K2 => B2 > K3 => B3 > > And a third job that merges records with the same key, using > IdentityMapper and a custom Reducer: > > Job C: > K1 => A1,1; A2,2; B1 > K2 => A2,1; A2,2; B2 > K3 => B3 > > The trouble is, the A's and B's are large (20-30 KB each) and I have a > few million of them. If Job C has only one Reducer task, it takes > forever to copy and sort all the records. > > So here's my question -- does Hadoop guarantee that all records with > the same key will end up in the same Reducer task? If that's true, > then can I set the number of Reducers very high (even equal to the > number of maps) to make Job C go faster? > > Thanks for any enlightenment you can provide here, > -Stuart > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Do all Mapper outputs with same key go to same Reducer?
Hi all, The short version of my question is in the subject. Here's the long version: I have two map/reduce jobs that output records using a common key: Job A: K1 => A1,1 K1 => A1,2 K2 => A2,1 K2 => A2,2 Job B: K1 => B1 K2 => B2 K3 => B3 And a third job that merges records with the same key, using IdentityMapper and a custom Reducer: Job C: K1 => A1,1; A2,2; B1 K2 => A2,1; A2,2; B2 K3 => B3 The trouble is, the A's and B's are large (20-30 KB each) and I have a few million of them. If Job C has only one Reducer task, it takes forever to copy and sort all the records. So here's my question -- does Hadoop guarantee that all records with the same key will end up in the same Reducer task? If that's true, then can I set the number of Reducers very high (even equal to the number of maps) to make Job C go faster? Thanks for any enlightenment you can provide here, -Stuart