Spark has supported the window-based executor failure-tracking mechanism for YARN for a long time, SPARK-41210[1][2] (included in 3.5.0) extended this feature to K8s.
[1] https://issues.apache.org/jira/browse/SPARK-41210 [2] https://github.com/apache/spark/pull/38732 Thanks, Cheng Pan > On Feb 19, 2024, at 23:59, Sri Potluri <pssp...@gmail.com> wrote: > > Hello Spark Community, > > I am currently leveraging Spark on Kubernetes, managed by the Spark Operator, > for running various Spark applications. While the system generally works > well, I've encountered a challenge related to how Spark applications handle > executor failures, specifically in scenarios where executors enter an error > state due to persistent issues. > > Problem Description > > When an executor of a Spark application fails, the system attempts to > maintain the desired level of parallelism by automatically recreating a new > executor to replace the failed one. While this behavior is beneficial for > transient errors, ensuring that the application continues to run, it becomes > problematic in cases where the failure is due to a persistent issue (such as > misconfiguration, inaccessible external resources, or incompatible > environment settings). In such scenarios, the application enters a loop, > continuously trying to recreate executors, which leads to resource wastage > and complicates application management. > > Desired Behavior > > Ideally, I would like to have a mechanism to limit the number of retries for > executor recreation. If the system fails to successfully create an executor > more than a specified number of times (e.g., 5 attempts), the entire Spark > application should fail and stop trying to recreate the executor. This > behavior would help in efficiently managing resources and avoiding prolonged > failure states. > > Questions for the Community > > 1. Is there an existing configuration or method within Spark or the Spark > Operator to limit executor recreation attempts and fail the job after > reaching a threshold? > > 2. Has anyone else encountered similar challenges and found workarounds or > solutions that could be applied in this context? > > > Additional Context > > I have explored Spark's task and stage retry configurations > (`spark.task.maxFailures`, `spark.stage.maxConsecutiveAttempts`), but these > do not directly address the issue of limiting executor creation retries. > Implementing a custom monitoring solution to track executor failures and > manually stop the application is a potential workaround, but it would be > preferable to have a more integrated solution. > > I appreciate any guidance, insights, or feedback you can provide on this > matter. > > Thank you for your time and support. > > Best regards, > Sri P --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org