Hi there,

We use the Flink Kubernetes Operator, and I am investigating how we can
easily support failing over a FlinkDeployment from one Kubernetes Cluster
to another in the case of an outage that requires us to migrate a large
number of FlinkDeployments from one K8s cluster to another.

I understand one way to do this is to set `initialSavepoint` on all the
FlinkDeployments to the most recent/appropriate snapshot so the jobs
continue from where they left off, but for a large number of jobs, this
would be quite a bit of manual labor.

Do others have an approach they are using? Any advice?

Could this be something addressed in a future FLIP? Perhaps we could store
some kind of metadata in object storage so that the Flink Kubernetes
Operator can restore a FlinkDeployment from where it left off, even if the
job is shifted to another Kubernetes Cluster.

Looking forward to hearing folks' thoughts!

Reply via email to