[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184107#comment-13184107
 ] 

anty.rao commented on MAPREDUCE-3397:
-------------------------------------

The submitted path do not yet eliminated the shuffle phase barrier?
                
> Support no sort dataflow in map output and reduce merge phrase
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-3397
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3397
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>          Components: task
>    Affects Versions: 0.20.205.0
>            Reporter: Binglin Chang
>            Assignee: Binglin Chang
>         Attachments: MAPREDUCE-3397-nosort.v1.patch
>
>
> In our experience, many data aggregation style queries/jobs don't need to 
> sort the intermediate data. In fact reducer side can use hashmap or even 
> array to do application level aggregations. For example, consider computing 
> CTR using display log & click log in sponsored search. Map side just emit 
> (adv_id, clk_cnt, dis_cnt), reduce side aggregate clk_cnt and dis_cnt for 
> every adv_id, cause adv_id is integer, we can partition adv_id by range:
> ** reduce0: 0-100000
> ** reduce1: 100000-200000
> ** ...
> ** reduceM: xxx-max adv-id
> Then the reducer can use an array(for example: int [1000000][2]) to store the 
> aggregated clk_cnt & dis_cnt, and we don't need the framework to sort 
> intermediate data anymore.
> By supporting no sort, we can gain a lot of performance improvements:
> # Eliminate map side sort & merge. 
>   KV paris need to sort by partition first, but this can be done using a 
> liner time counting sort, which is much faster than quick sort.
>   Just merge spill segments one by one, doesn't need to use heap merge.
> # Eliminate shuffle phrase barrier, reducer can start to processing data 
> before all map output data are copied & merged.
> For most cases, memory won't be a problem, cause keys are divided to many 
> partitions, each reducers only process a small subset of the global key set. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to