We discussed this early on in our fork and I think we should have this in a
JIRA and discuss it further. It's something we want to address in the
future.

One proposed method is using a StatefulSet of size 1 for the driver. This
ensures recovery but at the same time takes away from the completion
semantics of a single pod.


See history in https://github.com/apache-spark-on-k8s/spark/issues/288

On Wed, Mar 28, 2018, 6:56 AM Lucas Kacher <lu...@vsco.co> wrote:

> A carry-over from the apache-spark-on-k8s project, it would be useful to
> have a configurable restart policy for submitted jobs with the Kubernetes
> resource manager. See the following issues:
>
> https://github.com/apache-spark-on-k8s/spark/issues/133
> https://github.com/apache-spark-on-k8s/spark/issues/288
> https://github.com/apache-spark-on-k8s/spark/issues/546
>
> Use case: I have a structured streaming job that reads from Kafka,
> aggregates, and writes back out to Kafka deployed via k8s and checkpointing
> to a remote location. If the driver pod dies for a any number of reasons,
> it will not restart.
>
> For us, as all data is stored via checkpoint and we are satisfied with
> at-least-once semantics, it would be useful if the driver were to come back
> on it's own and pick back up.
>
> Firstly, may we add this to JIRA? Secondly, Is there any insight as to
> what the thought is around allowing that to be configurable in the future?
> If it's not something that will happen natively, we will end up needing to
> write something that polls or listens to k8s and has the ability to
> re-submit any failed jobs.
>
> Thanks!
>
> --
>
> *Lucas Kacher*Senior Engineer
> -
> vsco.co <https://www.vsco.co/>
> New York, NY
> 818.512.5239
>

Reply via email to