Github user sitalkedia commented on the issue: https://github.com/apache/spark/pull/17297 Thanks a lot @squito for taking a look at it and for your feedback. >> this is already true. when there is a fetch failure, the TaskSetManager is marked as zombie, and the DAGScheduler resubmits stages, but nothing actively kills running tasks. That is true but currently the DAG scheduler has no idea about which tasks are running and which are being aborted. With this change, the task set manager informs the dag scheduler about currently running/aborted tasks so that the DAG scheduler can avoid resubmitting duplicates. >> I don't think its true that it relaunches all tasks that hadn't completed when the fetch failure occurred. it relaunches all the tasks haven't completed, by the time the stage gets resubmitted. More tasks can complete in between the time of the first failure, and the time the stage is resubmitted. Yes that's true. I will update the PR description. >> So I think in (b) and (c), you are trying to avoid resubmitting tasks 3-9 on stage 1 attempt 1. the thing is, there is a strong reason to believe that the original version of those tasks will fail. Most likely, those tasks needs map output from the same executor that caused the first fetch failure. So Kay is suggesting that we take the opposite approach, and instead actively kill the tasks from stage 1 attempt 0. OTOH, its possible that (i) the issue may have been transient or (ii) the tasks already finished fetching that data before the error occurred. We really have no idea. In our case, we are observing that any transient issue on the shuffle service might cause few tasks to fail. While other reducers might not see the fetch failure because either they already fetched the data from that shuffle service or they are yet to fetch it. Killing all the reducers in those cases is waste of a lot of work and also as I mentioned above, we might end of in a state where jobs will not make any progress at all in case of frequent fetch failure, because they will just flip-flop between two stage.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org