[ https://issues.apache.org/jira/browse/SPARK-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073597#comment-14073597 ]
Kay Ousterhout edited comment on SPARK-2387 at 7/24/14 8:23 PM: ---------------------------------------------------------------- Have you done experiments to understand how much this improves performance? With Hadoop MapReduce, I've seen this behavior significantly worsen performance for a few reasons. Ultimately, the problem is that the "reduce" stage (the one the depends on the shuffle map stage) can't finish until all of the map tasks finish. So, if there is a long map straggler, the reduce tasks can't finish anyway -- and now many more slots are hogged by the early reducers, preventing other jobs from making progress. Even worse, if reduce tasks are launched before all map tasks have been launched, the early reducers keep map tasks from being launched, but can end up stopped waiting for input from mappers that haven't completed yet. (Although it looks like your pull request is done in a way that tries to avoid the latter problem.) As a result of the above issues, I've heard that many places (I think Facebook, for example) disable this behavior in Hadoop. So, we should make sure this will not hurt performance (and will significantly help!) before adding a lot of complexity to Spark in order to implement it. was (Author: kayousterhout): Have you done experiments to understand how much this improves performance? With Hadoop MapReduce, I've seen this behavior significantly worsen performance for a few reasons. Ultimately, the problem is that the "reduce" stage (the one the depends on the shuffle map stage) can't finish until all of the map tasks finish. So, if there is a long map straggler, the reduce tasks can't finish anyway -- and now many more slots are hogged by the early reducers, preventing other jobs from making progress. Even worse, if reduce tasks are launched before all map tasks have been launched, the early reducers keep map tasks from being launched, but can end up stopped waiting for input from mappers that haven't completed yet. (Although I didn't look closely at PR1328 so I'm not sure if the latter issue was explicitly prevented in your pull request.) As a result of the above issues, I've heard that many places (I think Facebook, for example) disable this behavior in Hadoop. So, we should make sure this will not hurt performance (and will significantly help!) before adding a lot of complexity to Spark in order to implement it. > Remove the stage barrier for better resource utilization > -------------------------------------------------------- > > Key: SPARK-2387 > URL: https://issues.apache.org/jira/browse/SPARK-2387 > Project: Spark > Issue Type: New Feature > Components: Spark Core > Reporter: Rui Li > > DAGScheduler divides a Spark job into multiple stages according to RDD > dependencies. Whenever there’s a shuffle dependency, DAGScheduler creates a > shuffle map stage on the map side, and another stage depending on that stage. > Currently, the downstream stage cannot start until all its depended stages > have finished. This barrier between stages leads to idle slots when waiting > for the last few upstream tasks to finish and thus wasting cluster resources. > Therefore we propose to remove the barrier and pre-start the reduce stage > once there're free slots. This can achieve better resource utilization and > improve the overall job performance, especially when there're lots of > executors granted to the application. -- This message was sent by Atlassian JIRA (v6.2#6252)