[ https://issues.apache.org/jira/browse/SPARK-53145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhen Wang updated SPARK-53145: ------------------------------ Description: Duplicate of https://issues.apache.org/jira/browse/SPARK-49472, the task re-run many times, but in my case, the decommission was triggered by DRA. !task.png! related configurations: {code:java} spark.decommission.enabled=true spark.shuffle.service.enabled=true spark.dynamicAllocation.enabled=true{code} related logs: {code:java} 25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Decommission executors: 208 25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Notify executor 208 to decommission. 25/08/06 09:55:08 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(208, xxx, 22752, None)) as being decommissioning. 25/08/06 09:55:08 INFO ExecutorAllocationManager: Executors 208 removed due to idle timeout. 25/08/06 09:55:09 INFO YarnClusterScheduler: Executor 208 on xxx is decommissioned after 1.0 s. 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 99), so marking it as still running. 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 3), so marking it as still running. 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 170), so marking it as still running. 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 155), so marking it as still running. 25/08/06 09:55:09 INFO DAGScheduler: Executor lost: 208 (epoch 3) 25/08/06 09:55:09 INFO ExecutorMonitor: Executor 208 is removed. Remove reason statistics: (gracefully decommissioned: 203, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0). {code} was: Duplicate of https://issues.apache.org/jira/browse/SPARK-49472, the task re-run many times, but in my case, the decommission was triggered by DRA. !image-2025-08-06-14-19-56-484.png! related configurations: {code:java} spark.decommission.enabled=true spark.shuffle.service.enabled=true spark.dynamicAllocation.enabled=true{code} related logs: {code:java} 25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Decommission executors: 208 25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Notify executor 208 to decommission. 25/08/06 09:55:08 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(208, xxx, 22752, None)) as being decommissioning. 25/08/06 09:55:08 INFO ExecutorAllocationManager: Executors 208 removed due to idle timeout. 25/08/06 09:55:09 INFO YarnClusterScheduler: Executor 208 on xxx is decommissioned after 1.0 s. 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 99), so marking it as still running. 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 3), so marking it as still running. 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 170), so marking it as still running. 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 155), so marking it as still running. 25/08/06 09:55:09 INFO DAGScheduler: Executor lost: 208 (epoch 3) 25/08/06 09:55:09 INFO ExecutorMonitor: Executor 208 is removed. Remove reason statistics: (gracefully decommissioned: 203, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0). {code} > Task rerun caused by executor decommission triggered by DRA > ----------------------------------------------------------- > > Key: SPARK-53145 > URL: https://issues.apache.org/jira/browse/SPARK-53145 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.5.0 > Reporter: Zhen Wang > Priority: Major > > Duplicate of https://issues.apache.org/jira/browse/SPARK-49472, the task > re-run many times, but in my case, the decommission was triggered by DRA. > !task.png! > related configurations: > {code:java} > spark.decommission.enabled=true > spark.shuffle.service.enabled=true > spark.dynamicAllocation.enabled=true{code} > related logs: > > {code:java} > 25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Decommission executors: > 208 > 25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Notify executor 208 to > decommission. > 25/08/06 09:55:08 INFO BlockManagerMasterEndpoint: Mark BlockManagers > (BlockManagerId(208, xxx, 22752, None)) as being decommissioning. > 25/08/06 09:55:08 INFO ExecutorAllocationManager: Executors 208 removed due > to idle timeout. > 25/08/06 09:55:09 INFO YarnClusterScheduler: Executor 208 on xxx is > decommissioned after 1.0 s. > 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 99), so > marking it as still running. > 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 3), so > marking it as still running. > 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 170), so > marking it as still running. > 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 155), so > marking it as still running. > 25/08/06 09:55:09 INFO DAGScheduler: Executor lost: 208 (epoch 3) > 25/08/06 09:55:09 INFO ExecutorMonitor: Executor 208 is removed. Remove > reason statistics: (gracefully decommissioned: 203, decommision unfinished: > 0, driver killed: 0, unexpectedly exited: 0). {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org