On Sat, 2009-03-07 at 23:03 +0300, Mithila Nagendra wrote:
> Hey all
> Im using the hadoop version 0.18.3, and was wondering if the reduce phase
> starts only after the mapping is completed? Is it required that the Map
> phase is a 100% done, or can it be programmed in such a way that the reduce
> starts earlier?

As I understand it, the reducers have three phases:

 1) Copy Data from the mappers ("Shuffle")
 2) Sort the data on the reducer (by key)
 3) Actually run the data through the function you've defined.

<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/Reducer.html>

The Reducer tasks/processes start as soon as they are able to (I
believe), and copying data and sorting happens while there may still be
mappers running.

Stage (3) cannot be run until stage (2) is completed, which can
obviously not happen until all the mappers are complete.

In my experience, I haven't found this a major issue (especially if
there are many times more mappers than machines), since the shuffle and
sort stages take significant time and effort anyway.


Tim Wintle

Reply via email to