Re: Do all Mapper outputs with same key go to same Reducer?

2008-09-19 Thread Per Jacobsson
 >  If that's true, then can I set the number of Reducers very high
> (even equal to the number of maps) to make Job C go faster?

This page has some good info on finding the right number of reducers:
http://wiki.apache.org/hadoop/HowManyMapsAndReduces
/ Per

On Fri, Sep 19, 2008 at 9:42 AM, Miles Osborne <[EMAIL PROTECTED]> wrote:

> >
> So here's my question -- does Hadoop guarantee that all records with
> the same key will end up in the same Reducer task?  If that's true,
> >
>
> yes --think of the record as being sent to the task by hashing over the key
>
> Miles
>
> 2008/9/19 Stuart Sierra <[EMAIL PROTECTED]>:
> > Hi all,
> > The short version of my question is in the subject.  Here's the long
> version:
> > I have two map/reduce jobs that output records using a common key:
> >
> > Job A:
> > K1  =>  A1,1
> > K1  =>  A1,2
> > K2  =>  A2,1
> > K2  =>  A2,2
> >
> > Job B:
> > K1  =>  B1
> > K2  =>  B2
> > K3  =>  B3
> >
> > And a third job that merges records with the same key, using
> > IdentityMapper and a custom Reducer:
> >
> > Job C:
> > K1  =>  A1,1; A2,2; B1
> > K2  =>  A2,1; A2,2; B2
> > K3  =>  B3
> >
> > The trouble is, the A's and B's are large (20-30 KB each) and I have a
> > few million of them.  If Job C has only one Reducer task, it takes
> > forever to copy and sort all the records.
> >
> > Thanks for any enlightenment you can provide here,
> > -Stuart
> >
>
>


Re: Do all Mapper outputs with same key go to same Reducer?

2008-09-19 Thread Miles Osborne
>
So here's my question -- does Hadoop guarantee that all records with
the same key will end up in the same Reducer task?  If that's true,
>

yes --think of the record as being sent to the task by hashing over the key

Miles

2008/9/19 Stuart Sierra <[EMAIL PROTECTED]>:
> Hi all,
> The short version of my question is in the subject.  Here's the long version:
> I have two map/reduce jobs that output records using a common key:
>
> Job A:
> K1  =>  A1,1
> K1  =>  A1,2
> K2  =>  A2,1
> K2  =>  A2,2
>
> Job B:
> K1  =>  B1
> K2  =>  B2
> K3  =>  B3
>
> And a third job that merges records with the same key, using
> IdentityMapper and a custom Reducer:
>
> Job C:
> K1  =>  A1,1; A2,2; B1
> K2  =>  A2,1; A2,2; B2
> K3  =>  B3
>
> The trouble is, the A's and B's are large (20-30 KB each) and I have a
> few million of them.  If Job C has only one Reducer task, it takes
> forever to copy and sort all the records.
>
> So here's my question -- does Hadoop guarantee that all records with
> the same key will end up in the same Reducer task?  If that's true,
> then can I set the number of Reducers very high (even equal to the
> number of maps) to make Job C go faster?
>
> Thanks for any enlightenment you can provide here,
> -Stuart
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


Do all Mapper outputs with same key go to same Reducer?

2008-09-19 Thread Stuart Sierra
Hi all,
The short version of my question is in the subject.  Here's the long version:
I have two map/reduce jobs that output records using a common key:

Job A:
K1  =>  A1,1
K1  =>  A1,2
K2  =>  A2,1
K2  =>  A2,2

Job B:
K1  =>  B1
K2  =>  B2
K3  =>  B3

And a third job that merges records with the same key, using
IdentityMapper and a custom Reducer:

Job C:
K1  =>  A1,1; A2,2; B1
K2  =>  A2,1; A2,2; B2
K3  =>  B3

The trouble is, the A's and B's are large (20-30 KB each) and I have a
few million of them.  If Job C has only one Reducer task, it takes
forever to copy and sort all the records.

So here's my question -- does Hadoop guarantee that all records with
the same key will end up in the same Reducer task?  If that's true,
then can I set the number of Reducers very high (even equal to the
number of maps) to make Job C go faster?

Thanks for any enlightenment you can provide here,
-Stuart