[ https://issues.apache.org/jira/browse/SPARK-50648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mars updated SPARK-50648: ------------------------- Description: Assume a job is stage1-> stage2, when FetchFailed occurs during the stage 2, the stage1 and stage2 will resubmit (stage2 may still have some tasks running even if stage2 is resubmitted , this is as expected, the reason see the comment from https://issues.apache.org/jira/browse/SPARK-2666 ) But during the execution of the stage1-retry , if the SQL is canceled, the tasks in stage1 and stage1-retry can all be killed, but the tasks previously running in stage2 are still running and can't be killed. These tasks can greatly affect cluster stability and occupy resources. was: Assume a job is stage1-> stage2, when FetchFailed occurs during the stage 2, the stage1 and stage2 will retry and resubmit (but stage2 may still have some tasks still running ) in this case, the parent stage will be retried first, but during the execution of the parent stage, the SQL statement will be canceled, then > When the job is cancelled during shuffle retry, there are still zombie tasks > that continue to run > ------------------------------------------------------------------------------------------------- > > Key: SPARK-50648 > URL: https://issues.apache.org/jira/browse/SPARK-50648 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 4.0.0 > Reporter: Mars > Priority: Major > > Assume a job is stage1-> stage2, when FetchFailed occurs during the stage 2, > the stage1 and stage2 will resubmit (stage2 may still have some tasks > running even if stage2 is resubmitted , this is as expected, the reason see > the comment from https://issues.apache.org/jira/browse/SPARK-2666 ) > But during the execution of the stage1-retry , if the SQL is canceled, the > tasks in stage1 and stage1-retry can all be killed, but the tasks previously > running in stage2 are still running and can't be killed. These tasks can > greatly affect cluster stability and occupy resources. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org