Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Thanks a lot @squito for taking a look at it and for your feedback. 
    
    >> this is already true. when there is a fetch failure, the TaskSetManager 
is marked as zombie, and the DAGScheduler resubmits stages, but nothing 
actively kills running tasks.
    
    That is true but currently the DAG scheduler has no idea about which tasks 
are running and which are being aborted. With this change, the task set manager 
informs the dag scheduler about currently running/aborted tasks so that the DAG 
scheduler can avoid resubmitting duplicates.
    
    >> I don't think its true that it relaunches all tasks that hadn't 
completed when the fetch failure occurred. it relaunches all the tasks haven't 
completed, by the time the stage gets resubmitted. More tasks can complete in 
between the time of the first failure, and the time the stage is resubmitted.
    
    Yes that's true. I will update the PR description.
    
    
    >> So I think in (b) and (c), you are trying to avoid resubmitting tasks 
3-9 on stage 1 attempt 1. the thing is, there is a strong reason to believe 
that the original version of those tasks will fail. Most likely, those tasks 
needs map output from the same executor that caused the first fetch failure. So 
Kay is suggesting that we take the opposite approach, and instead actively kill 
the tasks from stage 1 attempt 0. OTOH, its possible that (i) the issue may 
have been transient or (ii) the tasks already finished fetching that data 
before the error occurred. We really have no idea.
    
    In our case, we are observing that any transient issue on the shuffle 
service might cause few tasks to fail. While other reducers might not see the 
fetch failure because either they already fetched the data from that shuffle 
service or they are yet to fetch it.  Killing all the reducers in those cases 
is waste of a lot of work and also as I mentioned above, we might end of in a 
state where jobs will not make any progress at all in case of frequent fetch 
failure, because they will just flip-flop between two stage.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to