Another thought could be modifying the operator to have a behaviour where
upon first deploy, it optionally (flag/param enabled) finds the most recent
snapshot and uses that as the initialSavepointPath to restore and run the
Flink job.

On Wed, Mar 6, 2024 at 2:07 PM Kevin Lam <kevin....@shopify.com> wrote:

> Hi there,
>
> We use the Flink Kubernetes Operator, and I am investigating how we can
> easily support failing over a FlinkDeployment from one Kubernetes Cluster
> to another in the case of an outage that requires us to migrate a large
> number of FlinkDeployments from one K8s cluster to another.
>
> I understand one way to do this is to set `initialSavepoint` on all the
> FlinkDeployments to the most recent/appropriate snapshot so the jobs
> continue from where they left off, but for a large number of jobs, this
> would be quite a bit of manual labor.
>
> Do others have an approach they are using? Any advice?
>
> Could this be something addressed in a future FLIP? Perhaps we could store
> some kind of metadata in object storage so that the Flink Kubernetes
> Operator can restore a FlinkDeployment from where it left off, even if the
> job is shifted to another Kubernetes Cluster.
>
> Looking forward to hearing folks' thoughts!
>

Reply via email to