[ https://issues.apache.org/jira/browse/SPARK-45182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot reassigned SPARK-45182: -------------------------------------- Assignee: Apache Spark > Ignore task completion from old stage after retrying indeterminate stages > ------------------------------------------------------------------------- > > Key: SPARK-45182 > URL: https://issues.apache.org/jira/browse/SPARK-45182 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.3.2 > Reporter: Mayur Bhosale > Assignee: Apache Spark > Priority: Minor > Labels: pull-request-available > > SPARK-25342 Added a support for rolling back shuffle map stage so that all > tasks of the stage can be retried when the stage output is indeterminate. > This is done by clearing all map outputs at the time of stage submission. > This approach workouts well except for this case: > Assume both Shuffle 1 and 2 are indeterminate > ShuffleMapStage1 –{-}–{-}> Shuffle 1 ---–> ShuffleMapStage2 ----> Shuffle 2 > ----> ResultStage > * ShuffleMapStage1 is complete > * A task from ShuffleMapStage2 fails with FetchFailed. Other tasks are still > running > * Both ShuffleMapStage1 and ShuffleMapStage2 are retried > * ShuffleMapStage1 is retried and completes > * ShuffleMapStage2 reattempt is scheduled for execution > * Before all tasks of ShuffleMapStage2 reattempt could finish, one/more > laggard tasks from the original attempt of ShuffleMapStage2 finish and > ShuffleMapStage2 also gets marked as complete > * Result Stage gets scheduled and finishes > Internally within Uber, we have been using the stage rollback functionality > even for deterministic stages from Spark 2.4.3 to add fault tolerance from > server going down in [remote shuffle service > |https://github.com/uber/RemoteShuffleService]and have faced this scenario > quite often > Ideally, such laggard tasks should not be considered towards the partition > completion. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org