Github user GraceH commented on a diff in the pull request: https://github.com/apache/spark/pull/7888#discussion_r44493155 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala --- @@ -509,6 +511,13 @@ private[spark] class ExecutorAllocationManager( private def onExecutorBusy(executorId: String): Unit = synchronized { logDebug(s"Clearing idle timer for $executorId because it is now running a task") removeTimes.remove(executorId) + + // Executor is added to remove by misjudgment due to async listener making it as idle). + // see SPARK-9552 + if (executorsPendingToRemove.contains(executorId)) { --- End diff -- here is the problem. 1. you have executor-1,-2,-3 to be killed (say timeout triggers that) 2. according to our new criteria, only executor-1 is eligible to kill. and -2,-3 are filtered out (force = false), and not to pass to `killExecutors`. Only executor-1 send out killing command, and return back its acknowledgement. 3. we get the acknowledgement (actually it only works for executor-1). and the current code path will add all executorID(-1,-2,-3) to `executorsPendingToRemove`. but actually, only -1 is the real killing case. In the dynamic allocation, we can do that hypothesis, since it only kills single executor each time. But for multiple executor case, there is no chance to tell the difference between executorIDs(to kill) and actual idle ones. Otherwise, we need to change the APIs to return back what the really killed executor-list.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org