On Mar 10, 2011, at 10:34 AM, Jeffrey Buell wrote:

> Rita said that she has 2 racks (not 2 nodes).  Rita, how many nodes per rack 
> do you have?
> 
> To continue the thread, could there be a performance advantage to having 
> greater replication in the shuffle or reduce phases?  That is, is hadoop 
> smart enough that when it needs data that are not on the local node, it finds 
> out which copy of that data is on the closest (in the network sense) node and 
> gets it from there?  

        The reduce phase doesn't read from HDFS.   It does the equiv. of a  
HTTP get from the tasktracker that hold the map's intermediate output.  The 
speed up here is that the reduce should get scheduled on the same node that one 
of the job's mapper tasks was scheduled, especially any hosts that have 
significant map output.  This could potentially reduce network usage, but in 
the end is likely to be insignificant.

Reply via email to