Hello,
I have been working on profiling the performance of certain parts of Hadoop 0.20.203.0. For this reason, I have set up a simple cluster that uses one node as the Namenode/Jobtracker, and one node as the sole Datanode/tasktracker. In this experiment, I run a job consisting of a single map task and a single reduce task. Both are simply using the default Mapper/Reducer implementations (the identity functions). The input of the job is a file with a single 256MB block. Therefore, the output of the map task is 256MB, and the reduce task must shuffle that 256MB from the local host. To my surprise, shuffling this amount of data takes around 9 seconds, which is excessively slow. First I turned my attention to the ReduceTask.ReduceOutputCopier. I determined that about 1.1 seconds is spent calculating checksums (this is the expected value), and the remaining time is spent reading from the stream returned by URLConnection.getInputStream(). Some simple tests with URLConnection could not reproduce that issue except if it was actually reading from the TaskTracker's MapOutputServlet, so the problem seemed to be on the server side. Reading the same amount of data from any other local web server takes only 0.2s. I inserted some measurements into the MapOutputServlet and determined that 0.1s was spent reading the intermediate file (unsurprising as it was still in the page cache) and 7.7s are spent writing to the stream returned by response.getOutputStream(). The slowdown therefore appears to be in Jetty. CPU usage during the transfer appears to be low, so it feels like the transfer is getting throttled somehow. But if that's the case I can't figure out how that's happening. There's nothing in the source code to lead me to believe Hadoop is deliberately throttling anything, and as far as I know Jetty doesn't throttle by default. I was seeing some warnings in the tasktracker log file related to this: http://wiki.eclipse.org/Jetty/Feature/JVM_NIO_Bug However, running Hadoop under Java 7 made those warnings disappear and the transfer is still slow, so I don't think that's it. I'm out of ideas as to what could be causing this. Any insights? Regards, Sven