[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated MAPREDUCE-2354:
----------------------------------------
    Fix Version/s:     (was: 0.24.0)

> Shuffle should be optimized
> ---------------------------
>
>                 Key: MAPREDUCE-2354
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2354
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task, tasktracker
>    Affects Versions: 0.20.1
>            Reporter: MengWang
>              Labels: mapreduce, shuffle
>
> Our study shows that shuffle is a performance bottleneck of mapreduce 
> computing. There are some problems of shuffle:
> (1)Shuffle and reduce are tightly-coupled, usually shuffle phase doesn't 
> consume too much memory and CPU, so theoretically, reducetasks's slot can be 
> used for other computing tasks when copying data from maps. This method will 
> enhance cluster utilization. Furthermore, should shuffle be separated from 
> reduce? Then shuffle will not use reduce's slot,we need't distinguish between 
> map slots and reduce slots at all.
> (2)For large jobs, shuffle will use too many network connections, Data 
> transmitted by each network connection is very little, which is inefficient. 
> From 0.21.0 one connection can transfer several map outputs, but i think this 
> is not enough. Maybe we can use a per node shuffle client progress(like 
> tasktracker) to shuffle data for all reduce tasks on this node, then we can 
> shuffle more data trough one connection.
> (3)Too many concurrent connections will cause shuffle server do massive 
> random IO, which is inefficient. Maybe we can aggregate http request(like 
> delay scheduler), then random IO will be sequential.
> (4)How to manage memory used by shuffle efficiently. We use buddy memory 
> allocation, which will waste a considerable amount of memory.
> (5)If shuffle separated from reduce, then we must figure out how to do reduce 
> locality?
> (6)Can we store map outputs in a Storage system(like hdfs)?
> (7)Can shuffle be a general data transfer service, which not only for 
> map/reduce paradigm?
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to