OK i'll be a bit more specific , Suppose map outputs 100 different keys .
Consider a key "K" whose correspoding values may be on N diff datanodes. Consider a datanode "D" which have maximum number of values . So instead of moving the values on "D" to other systems , it is useful to bring in the values from other datanodes to "D" to minimize the data movement and also the delay. Similar is the case with All the other keys . How does the scheduler take care of this ? 2009/8/21 zjffdu <[email protected]> > Add some detials: > > 1. #map is determined by the block size and InputFormat (whether you can > want to split or not split) > > 2. The default scheduler for Hadoop is FIFO, and the Fair Scheduler and > Capacity Scheduler are other two options as I know. JobTracker has the > scheduler. > > 3. Once the map task is done, it will tell its own tasktracker, and the > tasktracker will tell jobtracker, so jobtracker manage all the tasks, and > it > will decide how to and when to start the reduce task > > > > -----Original Message----- > From: Arun C Murthy [mailto:[email protected]] > Sent: 2009年8月20日 11:41 > To: [email protected] > Subject: Re: MR job scheduler > > > On Aug 20, 2009, at 9:00 AM, bharath vissapragada wrote: > > > Hi all, > > > > Can anyone tell me how the MR scheduler schedule the MR jobs? > > How does it decide where t create MAP tasks and how many to create. > > Once the MAP tasks are over how does it decide to move the keys to the > > reducer efficiently(minimizing the data movement across the network). > > Is there any doc available which describes this scheduling process > > quite > > efficiently > > > > The #maps is decided by the application. The scheduler decides where > to execute them. > > Once the map is done, the reduce tasks connect to the tasktracker (on > the node where the map-task executed) and copies the entire output > over http. > > Arun > >
