Hi Ted, I was referring to version 0.20.0. As Todd pointed out, the issue I pointed out was fixed in version 0.20.2.
I only looked at the Cloudera version 0.20.2+228 (http://archive.cloudera.com/cdh/3/) currently in beta. I guess Hadoop 0.20.2 also has the fix. I will take a look at that too. Thanks, Abhishek On Sun, Apr 11, 2010 at 2:51 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Reading assignTasks() in 0.20.2 reveals that the number of map tasks > assigned is not limited to 1 per heartbeat. > > Cheers > > On Sun, Apr 11, 2010 at 12:30 PM, Todd Lipcon <t...@cloudera.com> wrote: > >> Hi Abhishek, >> >> This behavior is improved by MAPREDUCE-706 I believe (not certain that >> that's the JIRA, but I know it's fixed in trunk fairscheduler). These >> patches are included in CDH3 (currently in beta) >> http://archive.cloudera.com/cdh/3/ >> >> In general, though, map tasks that are so short are not going to be very >> efficient - even with fast assignment there is some constant overhead per >> task. >> >> Thanks >> -Todd >> >> On Sun, Apr 11, 2010 at 11:42 AM, abhishek sharma <absha...@usc.edu> >> wrote: >> >> > Hi all, >> > >> > I have been using the Hadoop Fair Scheduler for some experiments on a >> > 100 node cluster with 2 map slots per node (hence, a total of 200 map >> > slots). >> > >> > In one of my experiments, all the map tasks finish within a heartbeat >> > interval of 3 seconds. I noticed that the maximum number of >> > concurrently >> > active map slots on my cluster never exceeds 100, and hence, the >> > cluster utilization during my experiments never exceeds 50% even when >> > large jobs with more than a 1000 maps are being executed. >> > >> > A look at the Fair Scheduler code (in particular, the assignTasks >> > function) revealed the reason. >> > As per my understanding, with the implementation in Hadoop 0.20.0, a >> > TaskTracker is not assigned more than 1 map and 1 reduce task per >> > heart beat. >> > >> > In my experiments, in every heart beat, each TT has 2 free map slots >> > but is assigned only 1 map task, and hence, the utilization never goes >> > beyond 50%. >> > >> > Of course, this (degenerate) case does not arise when map tasks take >> > more than one 1 heart beat interval to finish. For example, I repeated >> > the experiments with maps tasks taking close to 15 s to finish and >> > noticed close to 100 % utilization when large jobs were executing. >> > >> > Why does the Fair Scheduler not assign more than one map task to a TT >> > per heart beat? Is this done to spread the load uniformly across the >> > cluster? >> > I looked at assignTasks function in the default Hadoop scheduler >> > (JobQueueTaskScheduler.java), and it does assign more than 1 map task >> > per heart beat to a TT. >> > >> > It will be easy to change the Fair Scheduler to assign more than 1 map >> > task to a TT per heart beat (I did that and achieved 100% utilization >> > even with small map tasks). But I am wondering, if doing so will >> > violate some fairness properties. >> > >> > Thanks, >> > Abhishek >> > >> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> >