Hello Spark Community, I am currently leveraging Spark on Kubernetes, managed by the Spark Operator, for running various Spark applications. While the system generally works well, I've encountered a challenge related to how Spark applications handle executor failures, specifically in scenarios where executors enter an error state due to persistent issues.
*Problem Description* When an executor of a Spark application fails, the system attempts to maintain the desired level of parallelism by automatically recreating a new executor to replace the failed one. While this behavior is beneficial for transient errors, ensuring that the application continues to run, it becomes problematic in cases where the failure is due to a persistent issue (such as misconfiguration, inaccessible external resources, or incompatible environment settings). In such scenarios, the application enters a loop, continuously trying to recreate executors, which leads to resource wastage and complicates application management. *Desired Behavior* Ideally, I would like to have a mechanism to limit the number of retries for executor recreation. If the system fails to successfully create an executor more than a specified number of times (e.g., 5 attempts), the entire Spark application should fail and stop trying to recreate the executor. This behavior would help in efficiently managing resources and avoiding prolonged failure states. *Questions for the Community* 1. Is there an existing configuration or method within Spark or the Spark Operator to limit executor recreation attempts and fail the job after reaching a threshold? 2. Has anyone else encountered similar challenges and found workarounds or solutions that could be applied in this context? *Additional Context* I have explored Spark's task and stage retry configurations (`spark.task.maxFailures`, `spark.stage.maxConsecutiveAttempts`), but these do not directly address the issue of limiting executor creation retries. Implementing a custom monitoring solution to track executor failures and manually stop the application is a potential workaround, but it would be preferable to have a more integrated solution. I appreciate any guidance, insights, or feedback you can provide on this matter. Thank you for your time and support. Best regards, Sri P