[ https://issues.apache.org/jira/browse/SPARK-40455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
caican updated SPARK-40455: --------------------------- Description: Here's a very serious bug: When result stage failed caused by `FetchFailedException`, the previous condition to determine whether result stage retries are allowed is `numMissingPartitions < resultStage.numTasks`. If this condition holds on retry, but the other tasks in the current result stage are not killed, when result stage was resubmit, it would got wrong partitions to recalculation ``` ``` {code:java} // DAGScheduler#submitMissingTasks // Figure out the indexes of partition ids to compute. val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code} was: Here's a very serious bug: When result stage failed caused by `FetchFailedException`, the previous condition to determine whether result stage retries are allowed is `numMissingPartitions < resultStage.numTasks`. If this condition holds on retry, but the other tasks in the current result stage are not killed, when result stage was resubmit, it would got wrong partitions to recalculation ``` // DAGScheduler#submitMissingTasks // Figure out the indexes of partition ids to compute. val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() ``` > Abort result stage directly when it failed caused by FetchFailed > ---------------------------------------------------------------- > > Key: SPARK-40455 > URL: https://issues.apache.org/jira/browse/SPARK-40455 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.0.0, 3.1.2, 3.2.1, 3.3.0 > Reporter: caican > Priority: Major > > Here's a very serious bug: > When result stage failed caused by `FetchFailedException`, the previous > condition to determine whether result stage retries are allowed is > `numMissingPartitions < resultStage.numTasks`. > > If this condition holds on retry, but the other tasks in the current result > stage are not killed, when result stage was resubmit, it would got wrong > partitions to recalculation > ``` > > ``` > {code:java} > // DAGScheduler#submitMissingTasks > > // Figure out the indexes of partition ids to compute. > val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org