We discussed this early on in our fork and I think we should have this in a JIRA and discuss it further. It's something we want to address in the future.
One proposed method is using a StatefulSet of size 1 for the driver. This ensures recovery but at the same time takes away from the completion semantics of a single pod. See history in https://github.com/apache-spark-on-k8s/spark/issues/288 On Wed, Mar 28, 2018, 6:56 AM Lucas Kacher <lu...@vsco.co> wrote: > A carry-over from the apache-spark-on-k8s project, it would be useful to > have a configurable restart policy for submitted jobs with the Kubernetes > resource manager. See the following issues: > > https://github.com/apache-spark-on-k8s/spark/issues/133 > https://github.com/apache-spark-on-k8s/spark/issues/288 > https://github.com/apache-spark-on-k8s/spark/issues/546 > > Use case: I have a structured streaming job that reads from Kafka, > aggregates, and writes back out to Kafka deployed via k8s and checkpointing > to a remote location. If the driver pod dies for a any number of reasons, > it will not restart. > > For us, as all data is stored via checkpoint and we are satisfied with > at-least-once semantics, it would be useful if the driver were to come back > on it's own and pick back up. > > Firstly, may we add this to JIRA? Secondly, Is there any insight as to > what the thought is around allowing that to be configurable in the future? > If it's not something that will happen natively, we will end up needing to > write something that polls or listens to k8s and has the ability to > re-submit any failed jobs. > > Thanks! > > -- > > *Lucas Kacher*Senior Engineer > - > vsco.co <https://www.vsco.co/> > New York, NY > 818.512.5239 >