Re: phases of Hadoop Jobs

Arun C Murthy Sun, 18 Sep 2011 21:18:03 -0700

Nan,

 The 'phase' is implicitly understood by the 'progress' (value) made by the 
map/reduce tasks (see o.a.h.mapred.TaskStatus.Phase).

 For e.g. 
 Reduce: 
 0-33% -> Shuffle
 34-66% -> Sort (actually, just 'merge', there is no sort in the reduce since 
all map-outputs are sorted)
 67-100% -> Reduce

 With 0.23 onwards the Map has phases too:
 0-90% -> Map
 91-100% -> Final Sort/merge

 Now,about starting reduces early - this is done to ensure shuffle can proceed 
for completed maps while rest of the maps run, there-by pipelining shuffle and 
map completion. There is a 'reduce slowstart' feature to control this - by 
default, reduces aren't started until 5% of maps are complete. Users can set 
this higher.

Arun

On Sep 18, 2011, at 7:24 PM, Nan Zhu wrote:

> Hi, all
> 
> recently, I was hit by a question, "how is a hadoop job divided into 2
> phases?",
> 
> In textbooks, we are told that the mapreduce jobs are divided into 2 phases,
> map and reduce, and for reduce, we further divided it into 3 stages,
> shuffle, sort, and reduce, but in hadoop codes, I never think about
> this question, I didn't see any variable members in JobInProgress class
> to indicate this information,
> 
> and according to my understanding on the source code of hadoop, the reduce
> tasks are unnecessarily started until all mappers are finished, in
> constract, we can see the reduce tasks are in shuffle stage while there are
> mappers which are still in running,
> So how can I indicate the phase which the job is belonging to?
> 
> Thanks
> -- 
> Nan Zhu
> School of Electronic, Information and Electrical Engineering,229
> Shanghai Jiao Tong University
> 800,Dongchuan Road,Shanghai,China
> E-Mail: zhunans...@gmail.com

Re: phases of Hadoop Jobs

Reply via email to