Github user kayousterhout commented on the issue:

    https://github.com/apache/spark/pull/16620
  
    @squito and @jinxing64 You're right -- with the existing code, if a task 
from an old attempt succeeded *and* didn't run on an executor where things 
already failed, the DAGScheduler will count the result (just realizing this 
based on [this 
if-statement](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1189)).
    
    That being said, I think this behavior is broken, because it leads to 
inconsistent state between the DAGScheduler (which thinks the stage is done and 
submits the next ones) and the TaskSetManager for the most recent version of 
the stage (which is still waiting on the more recent version of tasks to 
complete).  When the TaskSetManager for most recent version of the stage 
finishes all of its tasks, it will tell the DAGScheduler -- again -- that the 
stage has finished, causing the DAGScheduler to update the finish time for the 
stage and send another (duplicate) SparkListenerStageCompleted message to the 
listeners (I think this will result in stages in the UI that appear to be 
finished yet still have running tasks), and re-update the outputs for the map 
stage.  None of these things are obviously buggy (from a cursory look) but they 
violate a bunch of invariants in the scheduler, and I wouldn't be surprised if 
there were bugs lurking in this code path.  Given the amount of debugging a
 nd reviewer time that gets dedicated to these subtle bugs, I'm in favor of the 
simpler solution that maintains consistent state between the DAGScheduler and 
TaskSetManager.
    
    @squito where has this behavior been argued against in the past?  My 
understanding is that a bunch of the scheduler code is based on an assumption 
that once some tasks in a stage fail with a FetchFailure, we ignore future 
successes from that stage because it makes the code much simpler (it's also 
hard, in some cases, to know whether the successes are "real", or delayed 
messages from machines that later failed).  There was a bigger effort to fix 
that issue in [SPARK-14649](https://issues.apache.org/jira/browse/SPARK-14649), 
but there were a bunch of subtleties in getting that right, so for now effort 
on that has stopped.  If someone wants to re-start the effort on that, it seems 
useful, but I think should be de-coupled from fixing this bug.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to