On Thu, Jun 17, 2010 at 12:43 AM, Jeff Zhang <zjf...@gmail.com> wrote:
> Your understanding of Sort is not right. The key concept of Sort is > the TotalOrderPartitioner. Actually before the map-reduce job, client > side will do sampling of input data to estimate the distribution of > input data. And the mapper do nothing, each reducer will fetch its > data according the TotalOrderPartitioner. The data in each reducer is > local sorted, and each reducer are sorted ( r0<r1<r2....), so the > overall result data is sorted. > The sorting happens on the map side, actually, during the spill process. The mapper itself is an identity function, but the map task code does perform a sort (on a <partition,key> tuple) as originally described in this thread. Reducers just do a merge of mapper outputs. -Todd > > > > On Thu, Jun 17, 2010 at 12:13 AM, 李钰 <car...@gmail.com> wrote: > > Hi all, > > > > I'm doing some tuning of the sort benchmark of hadoop. To be more > specified, > > running test against the org.apache.hadoop.examples.Sort class. As > looking > > through the source code, I think the map tasks take responsibility of > > sorting the input data, and the reduce tasks just merge the map outputs > and > > write them into HDFS. But here I've got a question I couldn't understand: > > the time cost of the reduce phase of each reduce task, that is writing > data > > into HDFS, is different from each other. Since the input data and > operations > > of each reduce task is the same, what reason will cause the execution > time > > different? Is there anything wrong of my understanding? Does anybody have > > any experience on this? Badly need your help, thanks. > > > > Best Regards, > > Carp > > > > > > -- > Best Regards > > Jeff Zhang > -- Todd Lipcon Software Engineer, Cloudera