mridulm commented on code in PR #40286: URL: https://github.com/apache/spark/pull/40286#discussion_r1125790378
########## core/src/main/scala/org/apache/spark/internal/config/package.scala: ########## @@ -2479,4 +2479,14 @@ package object config { .version("3.4.0") .booleanConf .createWithDefault(false) + + private[spark] val STAGE_MAX_ATTEMPTS = + ConfigBuilder("spark.stage.maxAttempts") + .doc("The max attempts for a stage, the spark job will be aborted if any of its stages is " + + "resubmitted multiple times beyond the limitation. The value should be no less " + + "than `spark.stage.maxConsecutiveAttempts` which defines the max attempts for " + + "fetch failures.") + .version("3.5.0") + .intConf + .createWithDefault(16) Review Comment: Since this is a behavior change, let us make this an optional parameter - and preserve current behavior when not configured (or make default int max value). We can change this to a more restrictive default value in a future release. Given cascading stage retries and deployments where decom is not applicable (yarn for example), particularly due to `INDETERMINATE` stage, this minimizes application failures. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org