This would be useful to us, so I've created a JIRA ticket for this discussion: https://issues.apache.org/jira/browse/SPARK-24122
On Wed, Mar 28, 2018 at 10:28 AM, Anirudh Ramanathan < ramanath...@google.com.invalid> wrote: > We discussed this early on in our fork and I think we should have this in > a JIRA and discuss it further. It's something we want to address in the > future. > > One proposed method is using a StatefulSet of size 1 for the driver. This > ensures recovery but at the same time takes away from the completion > semantics of a single pod. > > > See history in https://github.com/apache-spark-on-k8s/spark/issues/288 > > On Wed, Mar 28, 2018, 6:56 AM Lucas Kacher <lu...@vsco.co> wrote: > >> A carry-over from the apache-spark-on-k8s project, it would be useful to >> have a configurable restart policy for submitted jobs with the Kubernetes >> resource manager. See the following issues: >> >> https://github.com/apache-spark-on-k8s/spark/issues/133 >> https://github.com/apache-spark-on-k8s/spark/issues/288 >> https://github.com/apache-spark-on-k8s/spark/issues/546 >> >> Use case: I have a structured streaming job that reads from Kafka, >> aggregates, and writes back out to Kafka deployed via k8s and checkpointing >> to a remote location. If the driver pod dies for a any number of reasons, >> it will not restart. >> >> For us, as all data is stored via checkpoint and we are satisfied with >> at-least-once semantics, it would be useful if the driver were to come back >> on it's own and pick back up. >> >> Firstly, may we add this to JIRA? Secondly, Is there any insight as to >> what the thought is around allowing that to be configurable in the future? >> If it's not something that will happen natively, we will end up needing to >> write something that polls or listens to k8s and has the ability to >> re-submit any failed jobs. >> >> Thanks! >> >> -- >> >> *Lucas Kacher*Senior Engineer >> - >> vsco.co <https://www.vsco.co/> >> New York, NY >> 818.512.5239 >> >