Let me rephrase,

1. Copy phase starts after reducer initialization, which happens before all 
maps have completed.
2. Which mapper has maximum values for a particular key wont be known until all 
mappers have completed ( to be more precise, until a particular percentage of 
running mappers is completed as we have the "current" maximum value mapper).
Also, there is no rule which says one record can go to only one reducer.

Thanks,
Amogh

-----Original Message-----
From: bharath vissapragada [mailto:bharathvissapragada1...@gmail.com] 
Sent: Friday, August 21, 2009 12:12 PM
To: common-user@hadoop.apache.org
Subject: Re: MR job scheduler

Yes , My doubt is that how is the location of the reducer selected . Is it
selected arbitrarily or is selected on a particular machine which has
already the more values (corresponding to the key of that reducer) which
reduces the cost of transferring data across the network(because already
many values to that key are on that machine where the map phase completed)..

2009/8/21 Amogh Vasekar <am...@yahoo-inc.com>

> Yes, but the copy phase starts with the initialization for a reducer, after
> which it would keep polling for completed map tasks to fetch the respective
> outputs.
>
> -----Original Message-----
> From: bharath vissapragada [mailto:bharathvissapragada1...@gmail.com]
> Sent: Friday, August 21, 2009 12:00 PM
> To: common-user@hadoop.apache.org
> Subject: Re: MR job scheduler
>
> Amogh
>
> i think Reduce phase starts only when all the map phases are completed .
> Because it needs all the values corresponding to a particular key!
>
> 2009/8/21 Amogh Vasekar <am...@yahoo-inc.com>
>
> > I'm not sure that is the case with Hadoop. I think its assigning reduce
> > task to an available tasktracker at any instant; Since a reducer polls JT
> > for completed maps. And if it were the case as you said, a reducer wont
> be
> > initialized until all maps have completed , after which copy phase would
> > start.
> >
> > Thanks,
> > Amogh
> >
> > -----Original Message-----
> > From: bharath vissapragada [mailto:bharathvissapragada1...@gmail.com]
> > Sent: Friday, August 21, 2009 9:50 AM
> > To: common-user@hadoop.apache.org
> > Subject: Re: MR job scheduler
> >
> > OK i'll be a bit more specific ,
> >
> > Suppose map outputs 100 different keys .
> >
> > Consider a key "K" whose correspoding values may be on N diff datanodes.
> > Consider a datanode "D" which have maximum number of values . So instead
> of
> > moving the values on "D"
> > to other systems , it is useful to bring in the values from other
> datanodes
> > to "D" to minimize the data movement and
> > also the delay. Similar is the case with All the other keys . How does
> the
> > scheduler take care of this ?
> > 2009/8/21 zjffdu <zjf...@gmail.com>
> >
> > > Add some detials:
> > >
> > > 1. #map is determined by the block size and InputFormat (whether you
> can
> > > want to split or not split)
> > >
> > > 2. The default scheduler for Hadoop is FIFO, and the Fair Scheduler and
> > > Capacity Scheduler are other two options as I know.  JobTracker has the
> > > scheduler.
> > >
> > > 3. Once the map task is done, it will tell its own tasktracker, and the
> > > tasktracker will tell jobtracker, so jobtracker manage all the tasks,
> and
> > > it
> > > will decide how to and when to start the reduce task
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Arun C Murthy [mailto:a...@yahoo-inc.com]
> > > Sent: 2009年8月20日 11:41
> > > To: common-user@hadoop.apache.org
> > > Subject: Re: MR job scheduler
> > >
> > >
> > > On Aug 20, 2009, at 9:00 AM, bharath vissapragada wrote:
> > >
> > > > Hi all,
> > > >
> > > > Can anyone tell me how the MR scheduler schedule the MR jobs?
> > > > How does it decide where t create MAP tasks and how many to create.
> > > > Once the MAP tasks are over how does it decide to move the keys to
> the
> > > > reducer efficiently(minimizing the data movement across the network).
> > > > Is there any doc available which describes this scheduling process
> > > > quite
> > > > efficiently
> > > >
> > >
> > > The #maps is decided by the application. The scheduler decides where
> > > to execute them.
> > >
> > > Once the map is done, the reduce tasks connect to the tasktracker (on
> > > the node where the map-task executed) and copies the entire output
> > > over http.
> > >
> > > Arun
> > >
> > >
> >
>

Reply via email to