Re: MR job scheduler

bharath vissapragada Thu, 20 Aug 2009 21:21:06 -0700

OK i'll be a bit more specific ,

Suppose map outputs 100 different keys .


Consider a key "K" whose correspoding values may be on N diff datanodes.
Consider a datanode "D" which have maximum number of values . So instead of
moving the values on "D"
to other systems , it is useful to bring in the values from other datanodes
to "D" to minimize the data movement and
also the delay. Similar is the case with All the other keys . How does the
scheduler take care of this ?
2009/8/21 zjffdu <[email protected]>

> Add some detials:
>
> 1. #map is determined by the block size and InputFormat (whether you can
> want to split or not split)
>
> 2. The default scheduler for Hadoop is FIFO, and the Fair Scheduler and
> Capacity Scheduler are other two options as I know.  JobTracker has the
> scheduler.
>
> 3. Once the map task is done, it will tell its own tasktracker, and the
> tasktracker will tell jobtracker, so jobtracker manage all the tasks, and
> it
> will decide how to and when to start the reduce task
>
>
>
> -----Original Message-----
> From: Arun C Murthy [mailto:[email protected]]
> Sent: 2009年8月20日 11:41
> To: [email protected]
> Subject: Re: MR job scheduler
>
>
> On Aug 20, 2009, at 9:00 AM, bharath vissapragada wrote:
>
> > Hi all,
> >
> > Can anyone tell me how the MR scheduler schedule the MR jobs?
> > How does it decide where t create MAP tasks and how many to create.
> > Once the MAP tasks are over how does it decide to move the keys to the
> > reducer efficiently(minimizing the data movement across the network).
> > Is there any doc available which describes this scheduling process
> > quite
> > efficiently
> >
>
> The #maps is decided by the application. The scheduler decides where
> to execute them.
>
> Once the map is done, the reduce tasks connect to the tasktracker (on
> the node where the map-task executed) and copies the entire output
> over http.
>
> Arun
>
>

Re: MR job scheduler

Reply via email to