[ https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16512691#comment-16512691 ]
Thomas Graves commented on SPARK-22148: --------------------------------------- ok, just update if you start working on it. thanks. > TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current > executors are blacklisted but dynamic allocation is enabled > ----------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-22148 > URL: https://issues.apache.org/jira/browse/SPARK-22148 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core > Affects Versions: 2.2.0 > Reporter: Juan RodrĂguez Hortalá > Priority: Major > Attachments: SPARK-22148_WIP.diff > > > Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and > the whole Spark job with `task X (partition Y) cannot run anywhere due to > node and executor blacklist. Blacklisting behavior can be configured via > spark.blacklist.*.` when all the available executors are blacklisted for a > pending Task or TaskSet. This makes sense for static allocation, where the > set of executors is fixed for the duration of the application, but this might > lead to unnecessary job failures when dynamic allocation is enabled. For > example, in a Spark application with a single job at a time, when a node > fails at the end of a stage attempt, all other executors will complete their > tasks, but the tasks running in the executors of the failing node will be > pending. Spark will keep waiting for those tasks for 2 minutes by default > (spark.network.timeout) until the heartbeat timeout is triggered, and then it > will blacklist those executors for that stage. At that point in time, other > executors would had been released after being idle for 1 minute by default > (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't > started yet and so there are no more tasks available (assuming the default of > spark.speculation = false). So Spark will fail because the only executors > available are blacklisted for that stage. > An alternative is requesting more executors to the cluster manager in this > situation. This could be retried a configurable number of times after a > configurable wait time between request attempts, so if the cluster manager > fails to provide a suitable executor then the job is aborted like in the > previous case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org