[ https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16123411#comment-16123411 ]
Sean Owen commented on SPARK-21656: ----------------------------------- I'm referring to the same issue you cite repeatedly, including: https://github.com/apache/spark/pull/18874#issuecomment-321313616 https://issues.apache.org/jira/browse/SPARK-21656?focusedCommentId=16117200&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16117200 Something like a driver busy in long GC pauses doesn't keep up with the fact that executors are non-idle and removes them. Its conclusion is incorrect and that's what we're trying to fix. All the more because going to 0 executors stops the stage. Right? I though we finally had it clear that this was the problem being fixed. Now you're just describing a job that needs a lower locality timeout. (Or else, describing a different problem with different solution, as in https://github.com/apache/spark/pull/18874#issuecomment-321625808 -- why do they take so much longer than 3s to fall back to other executors?) That scenario is not a reason to make this change. [~tgraves] please read https://github.com/apache/spark/pull/18874#issuecomment-321683515 . You're saying there's no counterpart scenario that is actually harmed by this change a bit, and I think there is. We need to get on the same page. > spark dynamic allocation should not idle timeout executors when there are > enough tasks to run on them > ----------------------------------------------------------------------------------------------------- > > Key: SPARK-21656 > URL: https://issues.apache.org/jira/browse/SPARK-21656 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.1.1 > Reporter: Jong Yoon Lee > Original Estimate: 24h > Remaining Estimate: 24h > > Right now with dynamic allocation spark starts by getting the number of > executors it needs to run all the tasks in parallel (or the configured > maximum) for that stage. After it gets that number it will never reacquire > more unless either an executor dies, is explicitly killed by yarn or it goes > to the next stage. The dynamic allocation manager has the concept of idle > timeout. Currently this says if a task hasn't been scheduled on that executor > for a configurable amount of time (60 seconds by default), then let that > executor go. Note when it lets that executor go due to the idle timeout it > never goes back to see if it should reacquire more. > This is a problem for multiple reasons: > 1 . Things can happen in the system that are not expected that can cause > delays. Spark should be resilient to these. If the driver is GC'ing, you have > network delays, etc we could idle timeout executors even though there are > tasks to run on them its just the scheduler hasn't had time to start those > tasks. Note that in the worst case this allows the number of executors to go > to 0 and we have a deadlock. > 2. Internal Spark components have opposing requirements. The scheduler has a > requirement to try to get locality, the dynamic allocation doesn't know about > this and if it lets the executors go it hurts the scheduler from doing what > it was designed to do. For example the scheduler first tries to schedule > node local, during this time it can skip scheduling on some executors. After > a while though the scheduler falls back from node local to scheduler on rack > local, and then eventually on any node. So during when the scheduler is > doing node local scheduling, the other executors can idle timeout. This > means that when the scheduler does fall back to rack or any locality where it > would have used those executors, we have already let them go and it can't > scheduler all the tasks it could which can have a huge negative impact on job > run time. > > In both of these cases when the executors idle timeout we never go back to > check to see if we need more executors (until the next stage starts). In the > worst case you end up with 0 and deadlock, but generally this shows itself by > just going down to very few executors when you could have 10's of thousands > of tasks to run on them, which causes the job to take way more time (in my > case I've seen it should take minutes and it takes hours due to only been > left a few executors). > We should handle these situations in Spark. The most straight forward > approach would be to not allow the executors to idle timeout when there are > tasks that could run on those executors. This would allow the scheduler to do > its job with locality scheduling. In doing this it also fixes number 1 above > because you never can go into a deadlock as it will keep enough executors to > run all the tasks on. > There are other approaches to fix this, like explicitly prevent it from going > to 0 executors, that prevents a deadlock but can still cause the job to > slowdown greatly. We could also change it at some point to just re-check to > see if we should get more executors, but this adds extra logic, we would have > to decide when to check, its also just overhead in letting them go and then > re-acquiring them again and this would cause some slowdown in the job as the > executors aren't immediately there for the scheduler to place things on. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org