Re: Performance tuning of sort

Todd Lipcon Thu, 17 Jun 2010 09:27:43 -0700

On Thu, Jun 17, 2010 at 12:43 AM, Jeff Zhang <zjf...@gmail.com> wrote:


> Your understanding of Sort is not right. The key concept of Sort is
> the TotalOrderPartitioner. Actually before the map-reduce job, client
> side will do sampling of input data to estimate the distribution of
> input data. And the mapper do nothing, each reducer will fetch its
> data according the TotalOrderPartitioner. The data in each reducer is
> local sorted, and each reducer are sorted ( r0<r1<r2....), so the
> overall result data is sorted.
>

The sorting happens on the map side, actually, during the spill process. The
mapper itself is an identity function, but the map task code does perform a
sort (on a <partition,key> tuple) as originally described in this thread.
Reducers just do a merge of mapper outputs.

-Todd


>
>
>
> On Thu, Jun 17, 2010 at 12:13 AM, 李钰 <car...@gmail.com> wrote:
> > Hi all,
> >
> > I'm doing some tuning of the sort benchmark of hadoop. To be more
> specified,
> > running test against the org.apache.hadoop.examples.Sort class. As
> looking
> > through the source code, I think the map tasks take responsibility of
> > sorting the input data, and the reduce tasks just merge the map outputs
> and
> > write them into HDFS. But here I've got a question I couldn't understand:
> > the time cost of the reduce phase of each reduce task, that is writing
> data
> > into HDFS, is different from each other. Since the input data and
> operations
> > of each reduce task is the same, what reason will cause the execution
> time
> > different? Is there anything wrong of my understanding? Does anybody have
> > any experience on this? Badly need your help, thanks.
> >
> > Best Regards,
> > Carp
> >
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Performance tuning of sort

Reply via email to