On Mar 10, 2011, at 10:34 AM, Jeffrey Buell wrote: > Rita said that she has 2 racks (not 2 nodes). Rita, how many nodes per rack > do you have? > > To continue the thread, could there be a performance advantage to having > greater replication in the shuffle or reduce phases? That is, is hadoop > smart enough that when it needs data that are not on the local node, it finds > out which copy of that data is on the closest (in the network sense) node and > gets it from there?
The reduce phase doesn't read from HDFS. It does the equiv. of a HTTP get from the tasktracker that hold the map's intermediate output. The speed up here is that the reduce should get scheduled on the same node that one of the job's mapper tasks was scheduled, especially any hosts that have significant map output. This could potentially reduce network usage, but in the end is likely to be insignificant.