Github user kayousterhout commented on the issue: https://github.com/apache/spark/pull/16620 @squito and @jinxing64 You're right -- with the existing code, if a task from an old attempt succeeded *and* didn't run on an executor where things already failed, the DAGScheduler will count the result (just realizing this based on [this if-statement](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1189)). That being said, I think this behavior is broken, because it leads to inconsistent state between the DAGScheduler (which thinks the stage is done and submits the next ones) and the TaskSetManager for the most recent version of the stage (which is still waiting on the more recent version of tasks to complete). When the TaskSetManager for most recent version of the stage finishes all of its tasks, it will tell the DAGScheduler -- again -- that the stage has finished, causing the DAGScheduler to update the finish time for the stage and send another (duplicate) SparkListenerStageCompleted message to the listeners (I think this will result in stages in the UI that appear to be finished yet still have running tasks), and re-update the outputs for the map stage. None of these things are obviously buggy (from a cursory look) but they violate a bunch of invariants in the scheduler, and I wouldn't be surprised if there were bugs lurking in this code path. Given the amount of debugging a nd reviewer time that gets dedicated to these subtle bugs, I'm in favor of the simpler solution that maintains consistent state between the DAGScheduler and TaskSetManager. @squito where has this behavior been argued against in the past? My understanding is that a bunch of the scheduler code is based on an assumption that once some tasks in a stage fail with a FetchFailure, we ignore future successes from that stage because it makes the code much simpler (it's also hard, in some cases, to know whether the successes are "real", or delayed messages from machines that later failed). There was a bigger effort to fix that issue in [SPARK-14649](https://issues.apache.org/jira/browse/SPARK-14649), but there were a bunch of subtleties in getting that right, so for now effort on that has stopped. If someone wants to re-start the effort on that, it seems useful, but I think should be de-coupled from fixing this bug.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org