Github user Ngone51 commented on the issue: https://github.com/apache/spark/pull/20930 Hi, @xuanyuanking , thank for your patient explanation, sincerely. With regard to your latest explanation: > stage 2's shuffleID is 1, but stage 3 failed by missing an output for shuffle '0'! So here the stage 2's skip cause stage 3 got an error shuffleId. However, I don't think stage 2's skip will lead to stage 3 got an error shuffleId, as we've already created all `ShuffleDependencies ` (constructed with certain ids) for `ShuffleMapStages` before any stages of a job submitted. As I struggle for understanding this issue for a while, finally, I got my own inference: (assume the 2 ShuffleMapTasks below is belong to stage 2, and stage 2 has two partitions on map side. And stage 2 has a parent stage named stage 1, and a child stage named stage 3.) 1. ShuffleMapTask 0.0 run on ExecutorB, and write map output on ExecutorB, succeed normally. And now, there's only '1' available map output registered on `MapOutputTrackerMaster `. 2. ShuffleMapTask 1.0 is running on ExecutorA, and fetch data from ExecutorA, and write map output on ExecutorA, too. 3. ExecutorA lost for unknown reason after send `StatusUpdate` message to driver, which tells ShuffleMapTask 1.0's success. And all map outputs on ExecutorA lost, include ShuffleMapTask 1.0's map output. 4. And driver launch a speculative ShuffleMapTask 1.1 before it receives the `StatusUpdate` message. And ShuffleMapTask 1.1 get FetchFailed immediately. 5. `DAGScheduler` handle the FetchFailed ShuffleMapTask 1.1 firstly, mark stage 2 and it's parent stage 1 as failed. And stage 1 & stage 2 are waiting for resubmit. 6. `DAGScheduler ` handle the success ShuffleMapTask 1.0 before stage 1 & stage 2 resubmit, which trigger `MapOutputTrackerMaster.registerMapOutput` . And now, there's '2' available map output registered on `MapOutputTrackerMaster ` (but knowing ShuffleMapTask 1.0's map output on ExecutorA has been lost.). 7. stage 1 resubmitted and succeed normally. 8. stage 2 resubmitted. As stage 2 has '2' available map output registered on `MapOutputTrackerMaster `, so there's no missing partitions for stage 2. Thus, stage 2 has no missing tasks to submit, too. 9. And then, we submit stage 3. As stage 2's map output file lost on ExecutorA, so stage 3 must get a FetchFailed at the end. Then, we resubmit stage 2& stage 3. And then we get into a loop until stag 3 abort. But if the issue is what I described above, we should get `FetchFailedException` instead of `MetadataFetchFailedException` shown in screenshot. So, at this point which can not make sense. Please feel free to point my wrong spot out. Anyway, thanks again.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org