Nan, The 'phase' is implicitly understood by the 'progress' (value) made by the map/reduce tasks (see o.a.h.mapred.TaskStatus.Phase).
For e.g. Reduce: 0-33% -> Shuffle 34-66% -> Sort (actually, just 'merge', there is no sort in the reduce since all map-outputs are sorted) 67-100% -> Reduce With 0.23 onwards the Map has phases too: 0-90% -> Map 91-100% -> Final Sort/merge Now,about starting reduces early - this is done to ensure shuffle can proceed for completed maps while rest of the maps run, there-by pipelining shuffle and map completion. There is a 'reduce slowstart' feature to control this - by default, reduces aren't started until 5% of maps are complete. Users can set this higher. Arun On Sep 18, 2011, at 7:24 PM, Nan Zhu wrote: > Hi, all > > recently, I was hit by a question, "how is a hadoop job divided into 2 > phases?", > > In textbooks, we are told that the mapreduce jobs are divided into 2 phases, > map and reduce, and for reduce, we further divided it into 3 stages, > shuffle, sort, and reduce, but in hadoop codes, I never think about > this question, I didn't see any variable members in JobInProgress class > to indicate this information, > > and according to my understanding on the source code of hadoop, the reduce > tasks are unnecessarily started until all mappers are finished, in > constract, we can see the reduce tasks are in shuffle stage while there are > mappers which are still in running, > So how can I indicate the phase which the job is belonging to? > > Thanks > -- > Nan Zhu > School of Electronic, Information and Electrical Engineering,229 > Shanghai Jiao Tong University > 800,Dongchuan Road,Shanghai,China > E-Mail: zhunans...@gmail.com