A carry-over from the apache-spark-on-k8s project, it would be useful to
have a configurable restart policy for submitted jobs with the Kubernetes
resource manager. See the following issues:

https://github.com/apache-spark-on-k8s/spark/issues/133
https://github.com/apache-spark-on-k8s/spark/issues/288
https://github.com/apache-spark-on-k8s/spark/issues/546

Use case: I have a structured streaming job that reads from Kafka,
aggregates, and writes back out to Kafka deployed via k8s and checkpointing
to a remote location. If the driver pod dies for a any number of reasons,
it will not restart.

For us, as all data is stored via checkpoint and we are satisfied with
at-least-once semantics, it would be useful if the driver were to come back
on it's own and pick back up.

Firstly, may we add this to JIRA? Secondly, Is there any insight as to what
the thought is around allowing that to be configurable in the future? If
it's not something that will happen natively, we will end up needing to
write something that polls or listens to k8s and has the ability to
re-submit any failed jobs.

Thanks!

-- 

*Lucas Kacher*Senior Engineer
-
vsco.co <https://www.vsco.co/>
New York, NY
818.512.5239

Reply via email to