[ 
https://issues.apache.org/jira/browse/SPARK-40082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580070#comment-17580070
 ] 

Min Shen commented on SPARK-40082:
----------------------------------

[~csingh] [~mridul] 

Want to bring your attention to this ticket. This seems an issue that we 
previously saw. Does upstream already have the fix for this?

> DAGScheduler may not schduler new stage in condition of push-based shuffle 
> enabled
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-40082
>                 URL: https://issues.apache.org/jira/browse/SPARK-40082
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 3.1.1
>            Reporter: Penglei Shi
>            Priority: Major
>         Attachments: missParentStages.png, shuffleMergeFinalized.png, 
> submitMissingTasks.png
>
>
> In condition of push-based shuffle being enabled and speculative tasks 
> existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, 
> then its parent stages will be resubmitting firstly and it will cost some 
> time to compute. Before the shuffleMapStage being resubmitted, its all 
> speculative tasks success and register map output, but speculative task 
> successful events can not trigger shuffleMergeFinalized because this stage 
> has been removed from runningStages.
> Then this stage is resubmitted, but speculative tasks have registered map 
> output and there are no missing tasks to compute, resubmitting stages will 
> also not trigger shuffleMergeFinalized. Eventually this stage‘s 
> _shuffleMergedFinalized keeps false.
> Then AQE will submit next stages which are dependent on  this shuffleMapStage 
> occurring fetchFailed. And in getMissingParentStages, this stage will be 
> marked as missing and will be resubmitted, but next stages are added to 
> waitingStages after this stage being finished, so next stages will not be 
> submitted even though this stage's resubmitting has been finished.
> I have only met some times in my production env and it is difficult to 
> reproduce。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to