Hi Mu, Small job overhead is something that has been worked on a bit in recent versions, but here's the gist of it (as best as I know, though I don't work much in this area of the code):
- The JobTracker doesn't assign tasks forcefully to TaskTrackers. Instead, the TaskTrackers send heartbeats at a certain interval (MRConstants.HEARTBEAT_INTERVAL_MIN). The minimum interval is once every 3 seconds. For every 100 nodes above 300, that interval increases by one second (MRConstants.CLUSTER_INCREMENT). - Because of this, each task from the JobTracker can take up to 3 seconds to get assigned to a TaskTracker. - I believe that the TaskTrackers also do not report Task Completion Events except as part of a Heartbeat. This means that after each task finishes, there can be another 3 second delay before the JobTracker finds out about it. - Though these things seem inefficient, the reasoning is that, in a large cluster of say 1000 nodes, the TTs could potentially overwhelm the JobTracker if the heartbeats were more frequent. With more nodes, the amount of time between a task being pending and a TT reporting a heartbeat is also likely to be small. Additionally, MapReduce is designed in general for large jobs where the amount of time spent in processing a task significantly eclipses the scheduling time. Given all of these delays, plus various amounts of time taken in copying your job JAR to and from HDFS, even an "empty" job can take many seconds. Around 20 sounds about right from my experience. Hope that helps -Todd On Sun, Jul 12, 2009 at 9:52 PM, Mu Qiao <qiao...@gmail.com> wrote: > Hi, everyone > > I've tested the hadoop environment I've set up. I noticed that it takes 24s > to run a 2 mapper, 1 reducer job with empty input. > Is it a reasonable time to run a do-nothing job? Why it takes so much time? > > Thanks > > -- > Best wishes, > Qiao Mu > MOE KLINNS Lab and SKLMS Lab, Xi'an Jiaotong University > Department of Computer Science and Technology, Xi’an Jiaotong University > TEL: 15991676983 > E-mail: qiao...@gmail.com >