[ https://issues.apache.org/jira/browse/MAPREDUCE-3397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184107#comment-13184107 ]
anty.rao commented on MAPREDUCE-3397: ------------------------------------- The submitted path do not yet eliminated the shuffle phase barrier? > Support no sort dataflow in map output and reduce merge phrase > -------------------------------------------------------------- > > Key: MAPREDUCE-3397 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3397 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: task > Affects Versions: 0.20.205.0 > Reporter: Binglin Chang > Assignee: Binglin Chang > Attachments: MAPREDUCE-3397-nosort.v1.patch > > > In our experience, many data aggregation style queries/jobs don't need to > sort the intermediate data. In fact reducer side can use hashmap or even > array to do application level aggregations. For example, consider computing > CTR using display log & click log in sponsored search. Map side just emit > (adv_id, clk_cnt, dis_cnt), reduce side aggregate clk_cnt and dis_cnt for > every adv_id, cause adv_id is integer, we can partition adv_id by range: > ** reduce0: 0-100000 > ** reduce1: 100000-200000 > ** ... > ** reduceM: xxx-max adv-id > Then the reducer can use an array(for example: int [1000000][2]) to store the > aggregated clk_cnt & dis_cnt, and we don't need the framework to sort > intermediate data anymore. > By supporting no sort, we can gain a lot of performance improvements: > # Eliminate map side sort & merge. > KV paris need to sort by partition first, but this can be done using a > liner time counting sort, which is much faster than quick sort. > Just merge spill segments one by one, doesn't need to use heap merge. > # Eliminate shuffle phrase barrier, reducer can start to processing data > before all map output data are copied & merged. > For most cases, memory won't be a problem, cause keys are divided to many > partitions, each reducers only process a small subset of the global key set. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira