RE: phases of Hadoop Jobs

2011-09-19 Thread GOEKE, MATTHEW (AG/1000)
-ui has to offer it doesn't take long to learn how to skim it and get 10x more accurate reading on your job progress. Matt -Original Message- From: Arun C Murthy [mailto:a...@hortonworks.com] Sent: Sunday, September 18, 2011 11:27 PM To: common-user@hadoop.apache.org Subject: Re: p

Re: phases of Hadoop Jobs

2011-09-18 Thread Kai Voigt
Hi Chen, yes, it saves time to move map() output to the nodes where they will be needed for the reduce() input. After map() has processed the first blocks, it makes sense to copy that output to the reduce nodes. Imagine a very large map() output. If shuffle© would be postponed after all map nod

Re: phases of Hadoop Jobs

2011-09-18 Thread He Chen
Or we can just seperate shuffle from reduce stage and integrate it to the map stage . Then we can clearly differentiate the map stage(before shuffle finish) and (after shuffle finish)the reduce stage. On Mon, Sep 19, 2011 at 1:20 AM, He Chen wrote: > Hi Kai > > Thank you for the reply. > > Th

Re: phases of Hadoop Jobs

2011-09-18 Thread He Chen
Hi Kai Thank you for the reply. The reduce() will not start because the shuffle phase does not finish. And the shuffle phase will not finish untill alll mapper end. I am curious about the design purpose about overlapping the map and reduce stage. Was this only for saving shuffling time? Or the

Re: phases of Hadoop Jobs

2011-09-18 Thread Kai Voigt
Hi Chen, the times when nodes running instances of the map and reduce nodes overlap. But map() and reduce() execution will not. reduce nodes will start copying data from map nodes, that's the shuffle phase. And the map nodes are still running during that copy phase. My observation had been tha

Re: phases of Hadoop Jobs

2011-09-18 Thread He Chen
Hi Arun I have a question. Do you know what is the reason that hadoop allows the map and the reduce stage overlap? Or anyone knows about it. Thank you in advance. Chen On Sun, Sep 18, 2011 at 11:17 PM, Arun C Murthy wrote: > Nan, > > The 'phase' is implicitly understood by the 'progress' (val

Re: phases of Hadoop Jobs

2011-09-18 Thread Nan Zhu
Hi, Arun , Thanks! As you explained, in the hadoop, we cannot explicitly divide job as two phase, map and reduce, but only for reduce task, we can judge which stage it's in, (shuffle, sort, reduce) (with 0.23 , we can also do it with mappers, ) right? Nan On Mon, Sep 19, 2011 at 12:17 PM, Aru

Re: phases of Hadoop Jobs

2011-09-18 Thread Arun C Murthy
Agreed. At least, I believe the new web-ui for MRv2 is (or will be soon) more verbose about this. On Sep 18, 2011, at 9:23 PM, Kai Voigt wrote: > Hi, > > this 0-33-66-100% phases are really confusing to beginners. We see that in > our training classes. The output should be more verbose, such

Re: phases of Hadoop Jobs

2011-09-18 Thread Kai Voigt
Hi, this 0-33-66-100% phases are really confusing to beginners. We see that in our training classes. The output should be more verbose, such as breaking down the phases into seperate progress numbers. Does that make sense? Am 19.09.2011 um 06:17 schrieb Arun C Murthy: > Nan, > > The 'phase'

Re: phases of Hadoop Jobs

2011-09-18 Thread Arun C Murthy
Nan, The 'phase' is implicitly understood by the 'progress' (value) made by the map/reduce tasks (see o.a.h.mapred.TaskStatus.Phase). For e.g. Reduce: 0-33% -> Shuffle 34-66% -> Sort (actually, just 'merge', there is no sort in the reduce since all map-outputs are sorted) 67-100% -> Red

Re: phases of Hadoop Jobs

2011-09-18 Thread He Chen
Hi Nan I have the same question for a while. In some research papers, people like to make the reduce stage to be slow start. In this way, the map stage and reduce stage are easy to differentiate. You can use the number of remaining unallocated map tasks to detect in which stage your job is. To le

phases of Hadoop Jobs

2011-09-18 Thread Nan Zhu
Hi, all recently, I was hit by a question, "how is a hadoop job divided into 2 phases?", In textbooks, we are told that the mapreduce jobs are divided into 2 phases, map and reduce, and for reduce, we further divided it into 3 stages, shuffle, sort, and reduce, but in hadoop codes, I never think