Where duplicated data is ignored?

psdc1978 Wed, 17 Feb 2010 02:35:16 -0800

Hi,

In Hadoop MapRed, when I define the number of reduce tasks to run,


<property>
        <name>mapred.reduce.tasks</name>
        <value>3</value>
</property>

I've noticed that during the execution of an MapRed example, the Reduces
threads request 9 times the MapOutputServlet on the TaskTracker. The value 9
comes from the 3 reduces tasks times 3 splits that exist that have map
output. The purpose of MapOutputServlet is to give the map output data to a
reduce thread.

Since the merge result from my example - btw the example is the one that
counts words - doesn't contain duplicated data, where the duplicated data is
ignored?

- Is it by the MapOutputServlet that detects that the split was already
requested?
- Is it by the Reduce task after retrieving data from the MapOutputServlet
and before the merging phase?
- Is it during the merging phase?

Thanks for the help,

-- 
Pedro

Where duplicated data is ignored?

Reply via email to