wangshengjie created SPARK-37580: ------------------------------------ Summary: Optimize current TaskSetManager abort logic when task failed count reach the threshold Key: SPARK-37580 URL: https://issues.apache.org/jira/browse/SPARK-37580 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.0 Reporter: wangshengjie
In production environment, we found some logic leak about TaskSetManager abort. For example: If one task has failed 3 times(max failed threshold is 4 in default), and there is a retry task and speculative task both in running state, then one of these 2 task attempts succeed and to cancel another. But executor which task need to be cancelled lost(oom in our situcation), this task marked as failed, and TaskSetManager handle this failed task attempt, it has failed 4 times so abort this stage and cause job failed. I created the patch for this bug and will soon be sent the pull request. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org