[ https://issues.apache.org/jira/browse/SPARK-40082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580070#comment-17580070 ]
Min Shen commented on SPARK-40082: ---------------------------------- [~csingh] [~mridul] Want to bring your attention to this ticket. This seems an issue that we previously saw. Does upstream already have the fix for this? > DAGScheduler may not schduler new stage in condition of push-based shuffle > enabled > ---------------------------------------------------------------------------------- > > Key: SPARK-40082 > URL: https://issues.apache.org/jira/browse/SPARK-40082 > Project: Spark > Issue Type: Bug > Components: Scheduler > Affects Versions: 3.1.1 > Reporter: Penglei Shi > Priority: Major > Attachments: missParentStages.png, shuffleMergeFinalized.png, > submitMissingTasks.png > > > In condition of push-based shuffle being enabled and speculative tasks > existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, > then its parent stages will be resubmitting firstly and it will cost some > time to compute. Before the shuffleMapStage being resubmitted, its all > speculative tasks success and register map output, but speculative task > successful events can not trigger shuffleMergeFinalized because this stage > has been removed from runningStages. > Then this stage is resubmitted, but speculative tasks have registered map > output and there are no missing tasks to compute, resubmitting stages will > also not trigger shuffleMergeFinalized. Eventually this stage‘s > _shuffleMergedFinalized keeps false. > Then AQE will submit next stages which are dependent on this shuffleMapStage > occurring fetchFailed. And in getMissingParentStages, this stage will be > marked as missing and will be resubmitted, but next stages are added to > waitingStages after this stage being finished, so next stages will not be > submitted even though this stage's resubmitting has been finished. > I have only met some times in my production env and it is difficult to > reproduce。 -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org