Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

Kevin Lam Wed, 13 Mar 2024 13:18:14 -0700

Hi Max,

It feels a bit hacky to need to back-up the resources directly from the
cluster, as opposed to being able to redeploy our checked-in k8s manifests
such that they failover correctly, but that makes sense to me and we can
look into this approach. Thanks for the suggestion!


I'd still be interested in hearing the community's thoughts on if we can
support this in a more first-class way as part of the Apache Flink
Kubernetes Operator.

Thanks,
Kevin

On Wed, Mar 13, 2024 at 9:41 AM Maximilian Michels <m...@apache.org> wrote:

> Hi Kevin,
>
> Theoretically, as long as you move over all k8s resources, failover
> should work fine on the Flink and Flink Operator side. The tricky part
> is the handover. You will need to backup all resources from the old
> cluster, shutdown the old cluster, then re-create them on the new
> cluster. The operator deployment and the Flink cluster should then
> recover fine (assuming that high availability has been configured and
> checkpointing is done to persistent storage available in the new
> cluster). The operator state / Flink state is actually kept in
> ConfigMaps which would be part of the resource dump.
>
> This method has proven to work in case of Kubernetes cluster upgrades.
> Moving to an entirely new cluster is a bit more involved but exporting
> all resource definitions and re-importing them into the new cluster
> should yield the same result as long as the checkpoint paths do not
> change.
>
> Probably something worth trying :)
>
> -Max
>
>
>
> On Wed, Mar 6, 2024 at 9:09 PM Kevin Lam <kevin....@shopify.com.invalid>
> wrote:
> >
> > Another thought could be modifying the operator to have a behaviour where
> > upon first deploy, it optionally (flag/param enabled) finds the most
> recent
> > snapshot and uses that as the initialSavepointPath to restore and run the
> > Flink job.
> >
> > On Wed, Mar 6, 2024 at 2:07 PM Kevin Lam <kevin....@shopify.com> wrote:
> >
> > > Hi there,
> > >
> > > We use the Flink Kubernetes Operator, and I am investigating how we can
> > > easily support failing over a FlinkDeployment from one Kubernetes
> Cluster
> > > to another in the case of an outage that requires us to migrate a large
> > > number of FlinkDeployments from one K8s cluster to another.
> > >
> > > I understand one way to do this is to set `initialSavepoint` on all the
> > > FlinkDeployments to the most recent/appropriate snapshot so the jobs
> > > continue from where they left off, but for a large number of jobs, this
> > > would be quite a bit of manual labor.
> > >
> > > Do others have an approach they are using? Any advice?
> > >
> > > Could this be something addressed in a future FLIP? Perhaps we could
> store
> > > some kind of metadata in object storage so that the Flink Kubernetes
> > > Operator can restore a FlinkDeployment from where it left off, even if
> the
> > > job is shifted to another Kubernetes Cluster.
> > >
> > > Looking forward to hearing folks' thoughts!
> > >
>

Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

Reply via email to