Thanks for your kind words Sri
Well it is true that as yet spark on kubernetes is not on-par with spark on
YARN in maturity and essentially spark on kubernetes is still work in
progress.* So in the first place IMO one needs to think why executors are
failing. What causes this behaviour? Is it the
Spark has supported the window-based executor failure-tracking mechanism for
YARN for a long time, SPARK-41210[1][2] (included in 3.5.0) extended this
feature to K8s.
[1] https://issues.apache.org/jira/browse/SPARK-41210
[2] https://github.com/apache/spark/pull/38732
Thanks,
Cheng Pan
> On
Dear Mich,
Thank you for your detailed response and the suggested approach to handling
retry logic. I appreciate you taking the time to outline the method of
embedding custom retry mechanisms directly into the application code.
While the solution of wrapping the main logic of the Spark job in a
Went through your issue with the code running on k8s
When an executor of a Spark application fails, the system attempts to
maintain the desired level of parallelism by automatically recreating a new
executor to replace the failed one. While this behavior is beneficial for
transient errors,
Not that I am aware of any configuration parameter in Spark classic to
limit executor creation. Because of fault tolerance Spark will try to
recreate failed executors. Not really that familiar with the Spark operator
for k8s. There may be something there.
Have you considered custom monitoring and