Hi, In Hadoop MapRed, when I define the number of reduce tasks to run,
<property> <name>mapred.reduce.tasks</name> <value>3</value> </property> I've noticed that during the execution of an MapRed example, the Reduces threads request 9 times the MapOutputServlet on the TaskTracker. The value 9 comes from the 3 reduces tasks times 3 splits that exist that have map output. The purpose of MapOutputServlet is to give the map output data to a reduce thread. Since the merge result from my example - btw the example is the one that counts words - doesn't contain duplicated data, where the duplicated data is ignored? - Is it by the MapOutputServlet that detects that the split was already requested? - Is it by the Reduce task after retrieving data from the MapOutputServlet and before the merging phase? - Is it during the merging phase? Thanks for the help, -- Pedro