[jira] [Updated] (SPARK-14293) Improve shuffle load balancing and minmize network data transmission

Cheng Pei (JIRA) Thu, 31 Mar 2016 05:43:53 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-14293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Cheng Pei updated SPARK-14293:
------------------------------
    Description: 
Currently Spark provides the mechanism to set preferred location for reduce 
task. When the fraction of total map output at a location is equal or greater 
than the  parameter REDUCER_PREF_LOCS_FRACTION, the reduce task will get a 
preferred location. But this does not consider load balancing and network 
transmission. Based on the map output sizes and locations tracked by 
MapOutputTracker, we can obtain a better load balancing.
 
This patch proposes a strategy to set preferred locations for each reduce task, 
which could firstly keep each executor process almost the same amount of 
intermediate data and secondly minimize the network data transmission. This can 
benefit some conditions:
1．      REDUCER_PREF_LOCS_FRACTION tries to place the reduce tasks close to the 
largest output. If there exists data skew in the map outputs, it could cause 
some executors that have large of map outputs become busy. Our method could 
avoid this case and minimize the network data transmission. 
2．      When there are large of reduce tasks in the job, it helps each executor 
processes almost the same data and keeps load balancing.


  was:
Currently Spark provides the mechanism to set preferred location for reduce 
task. When the fraction of total map output at a location is equal or greater 
than the  parameter REDUCER_PREF_LOCS_FRACTION, the reduce task will get a 
preferred location. But this does not consider load balancing and network 
transmission. Based on the map output sizes and locations tracked by 
MapOutputTracker, we can obtain a better load balancing.
 
This patch proposes a strategy to set preferred locations for each reduce task, 
which could firstly keep each executor process almost the same amount of 
intermediate data and secondly minimize the network data transmission. This can 
benefit some conditions:
1．      REDUCER_PREF_LOCS_FRACTION tries to place the reduce tasks close to the 
largest output. If there exists data skew in the map outputs. It could cause 
some executors that have large of map outputs become busy. Our method could 
avoid this case and minimize the network data transmission. 
2．      When there are large of reduce tasks in the job, it helps each executor 
processes almost the same data and keeps load balancing.



> Improve shuffle load balancing and minmize network data transmission
> --------------------------------------------------------------------
>
>                 Key: SPARK-14293
>                 URL: https://issues.apache.org/jira/browse/SPARK-14293
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.6.1
>            Reporter: Cheng Pei
>              Labels: performance
>
> Currently Spark provides the mechanism to set preferred location for reduce 
> task. When the fraction of total map output at a location is equal or greater 
> than the  parameter REDUCER_PREF_LOCS_FRACTION, the reduce task will get a 
> preferred location. But this does not consider load balancing and network 
> transmission. Based on the map output sizes and locations tracked by 
> MapOutputTracker, we can obtain a better load balancing.
>  
> This patch proposes a strategy to set preferred locations for each reduce 
> task, which could firstly keep each executor process almost the same amount 
> of intermediate data and secondly minimize the network data transmission. 
> This can benefit some conditions:
> 1．    REDUCER_PREF_LOCS_FRACTION tries to place the reduce tasks close to the 
> largest output. If there exists data skew in the map outputs, it could cause 
> some executors that have large of map outputs become busy. Our method could 
> avoid this case and minimize the network data transmission. 
> 2．    When there are large of reduce tasks in the job, it helps each executor 
> processes almost the same data and keeps load balancing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-14293) Improve shuffle load balancing and minmize network data transmission

Reply via email to