[jira] [Created] (SPARK-37580) Optimize current TaskSetManager abort logic when task failed count reach the threshold

wangshengjie (Jira) Wed, 08 Dec 2021 01:53:08 -0800

wangshengjie created SPARK-37580:
------------------------------------

             Summary: Optimize current TaskSetManager abort logic when task 
failed count reach the threshold
                 Key: SPARK-37580
                 URL: https://issues.apache.org/jira/browse/SPARK-37580
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.2.0
            Reporter: wangshengjie



In production environment, we found some logic leak about TaskSetManager abort. 
For example:

If one task has failed 3 times(max failed threshold is 4 in default), and there 
is a retry task and speculative task both in running state, then one of these 2 
task attempts succeed and to cancel another. But executor which task need to be 
cancelled lost(oom in our situcation), this task marked as failed, and 
TaskSetManager handle this failed task attempt, it has failed 4 times so abort 
this stage and cause job failed.

I created the patch for this bug and will soon be sent the pull request.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37580) Optimize current TaskSetManager abort logic when task failed count reach the threshold

Reply via email to