[ https://issues.apache.org/jira/browse/SPARK-40455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17605295#comment-17605295 ]
Apache Spark commented on SPARK-40455: -------------------------------------- User 'caican00' has created a pull request for this issue: https://github.com/apache/spark/pull/37899 > Abort result stage directly when it failed caused by FetchFailed > ---------------------------------------------------------------- > > Key: SPARK-40455 > URL: https://issues.apache.org/jira/browse/SPARK-40455 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.0.0, 3.1.2, 3.2.1, 3.3.0 > Reporter: caican > Assignee: Apache Spark > Priority: Major > > Here's a very serious bug: > When result stage failed caused by FetchFailedException, the previous > condition to determine whether result stage retries are allowed is > {color:#ff0000}numMissingPartitions < resultStage.numTasks{color}. > > If this condition holds on retry, but the other tasks at the current result > stage are not killed, when result stage was resubmit, it would got wrong > partitions to recalculation. > {code:java} > // DAGScheduler#submitMissingTasks > > // Figure out the indexes of partition ids to compute. > val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code} > It is possible that the number of partitions to be recalculated is smaller > than the actual number of partitions at result stage -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org