Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/22288#discussion_r227067905 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -415,9 +420,55 @@ private[spark] class TaskSchedulerImpl( launchedAnyTask |= launchedTaskAtCurrentMaxLocality } while (launchedTaskAtCurrentMaxLocality) } + if (!launchedAnyTask) { - taskSet.abortIfCompletelyBlacklisted(hostToExecutors) + taskSet.getCompletelyBlacklistedTaskIfAny(hostToExecutors) match { + case Some(taskIndex) => // Returns the taskIndex which was unschedulable + + // If the taskSet is unschedulable we try to find an existing idle blacklisted + // executor. If we cannot find one, we abort immediately. Else we kill the idle --- End diff -- I'm a little worried that the idle condition will be too strict in some scenarios, if there is a large backlog of tasks from another taskset, or whatever the error is, the tasks take a while to fail (eg., you've really got a bad executor, but its not apparent till after network timeouts or something). Eg. that could happen if you're doing a big join, and while preparing the input on the map-side, one side just has one straggler left but the other side still has a big backlog of tasks. Or, in a jobserver style situation, and there are always other tasksets coming in. that said, I don't have any better ideas at the moment, and I still think this is an improvement.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org