[ https://issues.apache.org/jira/browse/SPARK-22074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Imran Rashid updated SPARK-22074: --------------------------------- Component/s: Scheduler > Task killed by other attempt task should not be resubmitted > ----------------------------------------------------------- > > Key: SPARK-22074 > URL: https://issues.apache.org/jira/browse/SPARK-22074 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core > Affects Versions: 2.1.0, 2.2.0 > Reporter: Li Yuanjian > Labels: speculation > > When a task killed by other task attempt, the task still resubmitted while > its executor lost. There is a certain probability caused the stage hanging > forever because of the unnecessary resubmit(see the scenario description > below). Although the patch https://issues.apache.org/jira/browse/SPARK-13931 > can resolve the hanging problem(thx [~GavinGavinNo1] :) ), but the > unnecessary resubmit should abandon. > Detail scenario description: > 1. A ShuffleMapStage has many tasks, some of them finished successfully > 2. An Executor Lost happened, this will trigger a new TaskSet resubmitted, > includes all missing partitions. > 3. Before the resubmitted TaskSet completed, another executor which only > include the task killed by other attempt lost, trigger the Resubmitted Event, > current stage's pendingPartitions is not empty. > 4. Resubmitted TaskSet end, shuffleMapStage.isAvailable == true, but > pendingPartitions is not empty, never step into submitWaitingChildStages. > Leave the key logs of this scenario below: > {noformat} > 393332:17/09/11 13:45:24 [dag-scheduler-event-loop] INFO DAGScheduler: > Submitting 120 missing tasks from ShuffleMapStage 1046 > (MapPartitionsRDD[5321] at rdd at AFDEntry.scala:116) > 393333:17/09/11 13:45:24 [dag-scheduler-event-loop] INFO > YarnClusterScheduler: Adding task set 1046.0 with 120 tasks > 408766:17/09/11 13:46:25 [dispatcher-event-loop-5] INFO TaskSetManager: > Starting task 66.0 in stage 1046.0 (TID 110761, hidden-baidu-host.baidu.com, > executor 15, partition 66, PROCESS_LOCAL, 6237 bytes) > [1] Executor 15 lost, task 66.0 and 90.0 on it > 410532:17/09/11 13:46:32 [dispatcher-event-loop-47] INFO > YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 15. > 410900:17/09/11 13:46:33 [dispatcher-event-loop-34] INFO TaskSetManager: > Starting task 66.1 in stage 1046.0 (TID 111400, hidden-baidu-host.baidu.com, > executor 70, partition 66, PROCESS_LOCAL, 6237 bytes) > [2] Task 66.0 killed by 66.1 > 411315:17/09/11 13:46:37 [task-result-getter-2] INFO TaskSetManager: Killing > attempt 0 for task 66.0 in stage 1046.0 (TID 110761) on > hidden-baidu-host.baidu.com as the attempt 1 succeeded on > hidden-baidu-host.baidu.com > 411316:17/09/11 13:46:37 [task-result-getter-2] INFO TaskSetManager: Finished > task 66.1 in stage 1046.0 (TID 111400) in 3545 ms on > hidden-baidu-host.baidu.com (executor 70) (115/120) > [3] Executor 7 lost, task 0.0 72.0 7.0 on it > 411390:17/09/11 13:46:37 [dispatcher-event-loop-24] INFO > YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 7. > 416014:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: > ShuffleMapStage 1046 (rdd at AFDEntry.scala:116) finished in 94.577 s > [4] ShuffleMapStage 1046.0 finished, missing partition trigger resubmitted > 1046.1 > 416019:17/09/1 13:46:59 [dag-scheduler-event- oop] INFO DAGScheduler: > Resubmitting ShuffleMapStage 1046 (rdd at AFDEntry.scala:116) because some of > its tasks had failed: 0, 72, 79 > 416020:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: > Submitting ShuffleMapStage 1046 (MapPartitionsRDD[5321] at rdd at > AFDEntry.scala:116), which has no missing parents > 416030:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: > Submitting 3 missing tasks from ShuffleMapStage 1046 (MapPartitionsRDD[5321] > at rdd at AFDEntry.scala:116) > 416032:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO > YarnClusterScheduler: Adding task set 1046.1 with 3 tasks > 416034:17/09/11 13:46:59 [dispatcher-event-loop-21] INFO TaskSetManager: > Starting task 0.0 in stage 1046.1 (TID 112788, hidden-baidu-host.baidu.com, > executor 37, partition 0, PROCESS_LOCAL, 6237 bytes) > 416037:17/09/11 13:46:59 [dispatcher-event-loop-23] INFO TaskSetManager: > Starting task 1.0 in stage 1046.1 (TID 112789, > yq01-inf-nmg01-spark03-20160817113538.yq01.baidu.com, executor 69, partition > 72, PROCESS_LOCAL, 6237 bytes) > 416039:17/09/11 13:46:59 [dispatcher-event-loop-23] INFO TaskSetManager: > Starting task 2.0 in stage 1046.1 (TID 112790, hidden-baidu-host.baidu.com, > executor 26, partition 79, PROCESS_LOCAL, 6237 bytes) > [5] ShuffleMapStage 1046.1 still running, the attempted task killed by other > trigger the Resubmitted event > 416646:17/09/11 13:47:01 [dispatcher-event-loop-26] WARN TaskSetManager: Lost > task 66.0 in stage 1046.0 (TID 110761, hidden-baidu-host.baidu.com, executor > 15): ExecutorLostFailure (executor 15 exited caused by one of the running > tasks) Reason: Container marked as failed: > container_1502719603300_158941_01_104857616 on host: > hidden-baidu-host.baidu.com. Exit status: -100. Diagnostics: Container > released on a *lost* node > 416647:17/09/11 13:47:01 [dag-scheduler-event-loop] INFO DAGScheduler: > Resubmitted ShuffleMapTask(1046, 66), so marking it as still running > 416648:17/09/11 13:47:01 [dispatcher-event-loop-26] WARN TaskSetManager: Lost > task 90.0 in stage 1046.0 (TID 110788, hidden-baidu-host.baidu.com, executor > 15): ExecutorLostFailure (executor 15 exited caused by one of the running > tasks) Reason: Container marked as failed: > container_1502719603300_158941_01_104857616 on host: > hidden-baidu-host.baidu.com. Exit status: -100. Diagnostics: Container > released on a *lost* node > 416649:17/09/11 13:47:01 [dag-scheduler-event-loop] INFO DAGScheduler: > Resubmitted ShuffleMapTask(1046, 90), so marking it as still running > 417197:17/09/11 13:47:02 [task-result-getter-0] INFO TaskSetManager: Finished > task 0.0 in stage 1046.1 (TID 112788) in 3027 ms on > hidden-baidu-host.baidu.com (executor 37) (1/3) > 417206:17/09/11 13:47:02 [task-result-getter-1] INFO TaskSetManager: Finished > task 1.0 in stage 1046.1 (TID 112789) in 3106 ms on > yq01-inf-nmg01-spark03-20160817113538.yq01.baidu.com (executor 69) (2/3) > 417383:17/09/11 13:47:03 [task-result-getter-0] INFO TaskSetManager: Finished > task 2.0 in stage 1046.1 (TID 112790) in 3634 ms on > hidden-baidu-host.baidu.com (executor 26) (3/3) > 417384:17/09/11 13:47:03 [task-result-getter-0] INFO YarnClusterScheduler: > Removed TaskSet 1046.1, whose tasks have all completed, from pool > [6] Task 1046.1 success, but 1046 stage forever in running queue > 417817:17/09/11 13:47:06 [task-result-getter-2] INFO YarnClusterScheduler: > Removed TaskSet 1046.0, whose tasks have all completed, from pool > 417872:17/09/11 13:47:06 [dag-scheduler-event-loop] INFO DAGScheduler: > running: Set(ShuffleMapStage 1090, ResultStage 1069, ShuffleMapStage 1070, > ShuffleMapStage 1113, ShuffleMapStage 1092, ShuffleMapStage 1063, > ShuffleMapStage 1086, ShuffleMapStage 1065, ShuffleMapStage 1109, > ShuffleMapStage 1088, ShuffleMapStage 1111, ShuffleMapStage 1082, ResultStage > 1104, ShuffleMapStage 1105, ShuffleMapStage 1084, ShuffleMapStage 1076, > ShuffleMapStage 1107, ShuffleMapStage 1099, ShuffleMapStage 1078, > ShuffleMapStage 1101, ShuffleMapStage 1080, ShuffleMapStage 1072, > ShuffleMapStage 1094, ResultStage 1103, ShuffleMapStage 1074, ResultStage > 1096, ShuffleMapStage 1067, ShuffleMapStage 1046, ShuffleMapStage 1097) > ...... > 598792:17/09/11 13:58:55 [dag-scheduler-event-loop] INFO DAGScheduler: > running: Set(ShuffleMapStage 1577, ShuffleMapStage 1548, ShuffleMapStage > 1571, ShuffleMapStage 1550, ShuffleMapStage 1543, ShuffleMapStage 1573, > ShuffleMapStage 1065, ShuffleMapStage 1565, ResultStage 1545, ShuffleMapStage > 1567, ShuffleMapStage 1546, ShuffleMapStage 1539, ShuffleMapStage 1569, > ShuffleMapStage 1561, ShuffleMapStage 1541, ShuffleMapStage 1533, > ShuffleMapStage 1563, ResultStage 1556, ShuffleMapStage 1535, ShuffleMapStage > 1557, ShuffleMapStage 1579, ShuffleMapStage 1537, ShuffleMapStage 1559, > ShuffleMapStage 1581, ShuffleMapStage 1552, ShuffleMapStage 1531, > ShuffleMapStage 1575, ShuffleMapStage 1554, ShuffleMapStage 1046) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org