Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-06 Thread Kevin Lam
Hi there,

We use the Flink Kubernetes Operator, and I am investigating how we can
easily support failing over a FlinkDeployment from one Kubernetes Cluster
to another in the case of an outage that requires us to migrate a large
number of FlinkDeployments from one K8s cluster to another.

I understand one way to do this is to set `initialSavepoint` on all the
FlinkDeployments to the most recent/appropriate snapshot so the jobs
continue from where they left off, but for a large number of jobs, this
would be quite a bit of manual labor.

Do others have an approach they are using? Any advice?

Could this be something addressed in a future FLIP? Perhaps we could store
some kind of metadata in object storage so that the Flink Kubernetes
Operator can restore a FlinkDeployment from where it left off, even if the
job is shifted to another Kubernetes Cluster.

Looking forward to hearing folks' thoughts!


Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-06 Thread Kevin Lam
Another thought could be modifying the operator to have a behaviour where
upon first deploy, it optionally (flag/param enabled) finds the most recent
snapshot and uses that as the initialSavepointPath to restore and run the
Flink job.

On Wed, Mar 6, 2024 at 2:07 PM Kevin Lam  wrote:

> Hi there,
>
> We use the Flink Kubernetes Operator, and I am investigating how we can
> easily support failing over a FlinkDeployment from one Kubernetes Cluster
> to another in the case of an outage that requires us to migrate a large
> number of FlinkDeployments from one K8s cluster to another.
>
> I understand one way to do this is to set `initialSavepoint` on all the
> FlinkDeployments to the most recent/appropriate snapshot so the jobs
> continue from where they left off, but for a large number of jobs, this
> would be quite a bit of manual labor.
>
> Do others have an approach they are using? Any advice?
>
> Could this be something addressed in a future FLIP? Perhaps we could store
> some kind of metadata in object storage so that the Flink Kubernetes
> Operator can restore a FlinkDeployment from where it left off, even if the
> job is shifted to another Kubernetes Cluster.
>
> Looking forward to hearing folks' thoughts!
>


Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-13 Thread Maximilian Michels
Hi Kevin,

Theoretically, as long as you move over all k8s resources, failover
should work fine on the Flink and Flink Operator side. The tricky part
is the handover. You will need to backup all resources from the old
cluster, shutdown the old cluster, then re-create them on the new
cluster. The operator deployment and the Flink cluster should then
recover fine (assuming that high availability has been configured and
checkpointing is done to persistent storage available in the new
cluster). The operator state / Flink state is actually kept in
ConfigMaps which would be part of the resource dump.

This method has proven to work in case of Kubernetes cluster upgrades.
Moving to an entirely new cluster is a bit more involved but exporting
all resource definitions and re-importing them into the new cluster
should yield the same result as long as the checkpoint paths do not
change.

Probably something worth trying :)

-Max



On Wed, Mar 6, 2024 at 9:09 PM Kevin Lam  wrote:
>
> Another thought could be modifying the operator to have a behaviour where
> upon first deploy, it optionally (flag/param enabled) finds the most recent
> snapshot and uses that as the initialSavepointPath to restore and run the
> Flink job.
>
> On Wed, Mar 6, 2024 at 2:07 PM Kevin Lam  wrote:
>
> > Hi there,
> >
> > We use the Flink Kubernetes Operator, and I am investigating how we can
> > easily support failing over a FlinkDeployment from one Kubernetes Cluster
> > to another in the case of an outage that requires us to migrate a large
> > number of FlinkDeployments from one K8s cluster to another.
> >
> > I understand one way to do this is to set `initialSavepoint` on all the
> > FlinkDeployments to the most recent/appropriate snapshot so the jobs
> > continue from where they left off, but for a large number of jobs, this
> > would be quite a bit of manual labor.
> >
> > Do others have an approach they are using? Any advice?
> >
> > Could this be something addressed in a future FLIP? Perhaps we could store
> > some kind of metadata in object storage so that the Flink Kubernetes
> > Operator can restore a FlinkDeployment from where it left off, even if the
> > job is shifted to another Kubernetes Cluster.
> >
> > Looking forward to hearing folks' thoughts!
> >


Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-13 Thread Kevin Lam
Hi Max,

It feels a bit hacky to need to back-up the resources directly from the
cluster, as opposed to being able to redeploy our checked-in k8s manifests
such that they failover correctly, but that makes sense to me and we can
look into this approach. Thanks for the suggestion!

I'd still be interested in hearing the community's thoughts on if we can
support this in a more first-class way as part of the Apache Flink
Kubernetes Operator.

Thanks,
Kevin

On Wed, Mar 13, 2024 at 9:41 AM Maximilian Michels  wrote:

> Hi Kevin,
>
> Theoretically, as long as you move over all k8s resources, failover
> should work fine on the Flink and Flink Operator side. The tricky part
> is the handover. You will need to backup all resources from the old
> cluster, shutdown the old cluster, then re-create them on the new
> cluster. The operator deployment and the Flink cluster should then
> recover fine (assuming that high availability has been configured and
> checkpointing is done to persistent storage available in the new
> cluster). The operator state / Flink state is actually kept in
> ConfigMaps which would be part of the resource dump.
>
> This method has proven to work in case of Kubernetes cluster upgrades.
> Moving to an entirely new cluster is a bit more involved but exporting
> all resource definitions and re-importing them into the new cluster
> should yield the same result as long as the checkpoint paths do not
> change.
>
> Probably something worth trying :)
>
> -Max
>
>
>
> On Wed, Mar 6, 2024 at 9:09 PM Kevin Lam 
> wrote:
> >
> > Another thought could be modifying the operator to have a behaviour where
> > upon first deploy, it optionally (flag/param enabled) finds the most
> recent
> > snapshot and uses that as the initialSavepointPath to restore and run the
> > Flink job.
> >
> > On Wed, Mar 6, 2024 at 2:07 PM Kevin Lam  wrote:
> >
> > > Hi there,
> > >
> > > We use the Flink Kubernetes Operator, and I am investigating how we can
> > > easily support failing over a FlinkDeployment from one Kubernetes
> Cluster
> > > to another in the case of an outage that requires us to migrate a large
> > > number of FlinkDeployments from one K8s cluster to another.
> > >
> > > I understand one way to do this is to set `initialSavepoint` on all the
> > > FlinkDeployments to the most recent/appropriate snapshot so the jobs
> > > continue from where they left off, but for a large number of jobs, this
> > > would be quite a bit of manual labor.
> > >
> > > Do others have an approach they are using? Any advice?
> > >
> > > Could this be something addressed in a future FLIP? Perhaps we could
> store
> > > some kind of metadata in object storage so that the Flink Kubernetes
> > > Operator can restore a FlinkDeployment from where it left off, even if
> the
> > > job is shifted to another Kubernetes Cluster.
> > >
> > > Looking forward to hearing folks' thoughts!
> > >
>


Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-13 Thread Gyula Fóra
Hey Kevin!

The general mismatch I see here is that operators and resources are pretty
cluster dependent. The operator itself is running in the same cluster so it
feels out of scope to submit resources to different clusters, this doesn't
really sound like what any Kubernetes Operator should do in general.

To me this sounds more like a typical control plane feature that sits above
different environments and operator instances. There are a lot of features
like this, blue/green deployments also fall into this category in my head,
but there are of course many many others.

There may come a time when the Flink community decides to take on such a
scope but it feels a bit too much at this point to try to standardize this.

Cheers,
Gyula

On Wed, Mar 13, 2024 at 9:18 PM Kevin Lam 
wrote:

> Hi Max,
>
> It feels a bit hacky to need to back-up the resources directly from the
> cluster, as opposed to being able to redeploy our checked-in k8s manifests
> such that they failover correctly, but that makes sense to me and we can
> look into this approach. Thanks for the suggestion!
>
> I'd still be interested in hearing the community's thoughts on if we can
> support this in a more first-class way as part of the Apache Flink
> Kubernetes Operator.
>
> Thanks,
> Kevin
>
> On Wed, Mar 13, 2024 at 9:41 AM Maximilian Michels  wrote:
>
> > Hi Kevin,
> >
> > Theoretically, as long as you move over all k8s resources, failover
> > should work fine on the Flink and Flink Operator side. The tricky part
> > is the handover. You will need to backup all resources from the old
> > cluster, shutdown the old cluster, then re-create them on the new
> > cluster. The operator deployment and the Flink cluster should then
> > recover fine (assuming that high availability has been configured and
> > checkpointing is done to persistent storage available in the new
> > cluster). The operator state / Flink state is actually kept in
> > ConfigMaps which would be part of the resource dump.
> >
> > This method has proven to work in case of Kubernetes cluster upgrades.
> > Moving to an entirely new cluster is a bit more involved but exporting
> > all resource definitions and re-importing them into the new cluster
> > should yield the same result as long as the checkpoint paths do not
> > change.
> >
> > Probably something worth trying :)
> >
> > -Max
> >
> >
> >
> > On Wed, Mar 6, 2024 at 9:09 PM Kevin Lam 
> > wrote:
> > >
> > > Another thought could be modifying the operator to have a behaviour
> where
> > > upon first deploy, it optionally (flag/param enabled) finds the most
> > recent
> > > snapshot and uses that as the initialSavepointPath to restore and run
> the
> > > Flink job.
> > >
> > > On Wed, Mar 6, 2024 at 2:07 PM Kevin Lam 
> wrote:
> > >
> > > > Hi there,
> > > >
> > > > We use the Flink Kubernetes Operator, and I am investigating how we
> can
> > > > easily support failing over a FlinkDeployment from one Kubernetes
> > Cluster
> > > > to another in the case of an outage that requires us to migrate a
> large
> > > > number of FlinkDeployments from one K8s cluster to another.
> > > >
> > > > I understand one way to do this is to set `initialSavepoint` on all
> the
> > > > FlinkDeployments to the most recent/appropriate snapshot so the jobs
> > > > continue from where they left off, but for a large number of jobs,
> this
> > > > would be quite a bit of manual labor.
> > > >
> > > > Do others have an approach they are using? Any advice?
> > > >
> > > > Could this be something addressed in a future FLIP? Perhaps we could
> > store
> > > > some kind of metadata in object storage so that the Flink Kubernetes
> > > > Operator can restore a FlinkDeployment from where it left off, even
> if
> > the
> > > > job is shifted to another Kubernetes Cluster.
> > > >
> > > > Looking forward to hearing folks' thoughts!
> > > >
> >
>


Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-14 Thread Kevin Lam
Thanks for your response Gyula. Yes I understand, it doesn't really fit
nicely into the Kubernetes Operator pattern.

I do still wonder about the idea of supporting a feature where upon first
deploy, Flink Operator optionally (flag/param enabled) finds the most
recent snapshot (in a specified object storage URI) and uses that as the
initialSavepointPath to restore and run the Flink job. It doesn't require
being aware of clusters, or submitting resources to different clusters at
all, while still facilitating such fail overs.


On Wed, Mar 13, 2024 at 4:51 PM Gyula Fóra  wrote:

> Hey Kevin!
>
> The general mismatch I see here is that operators and resources are pretty
> cluster dependent. The operator itself is running in the same cluster so it
> feels out of scope to submit resources to different clusters, this doesn't
> really sound like what any Kubernetes Operator should do in general.
>
> To me this sounds more like a typical control plane feature that sits above
> different environments and operator instances. There are a lot of features
> like this, blue/green deployments also fall into this category in my head,
> but there are of course many many others.
>
> There may come a time when the Flink community decides to take on such a
> scope but it feels a bit too much at this point to try to standardize this.
>
> Cheers,
> Gyula
>
> On Wed, Mar 13, 2024 at 9:18 PM Kevin Lam 
> wrote:
>
> > Hi Max,
> >
> > It feels a bit hacky to need to back-up the resources directly from the
> > cluster, as opposed to being able to redeploy our checked-in k8s
> manifests
> > such that they failover correctly, but that makes sense to me and we can
> > look into this approach. Thanks for the suggestion!
> >
> > I'd still be interested in hearing the community's thoughts on if we can
> > support this in a more first-class way as part of the Apache Flink
> > Kubernetes Operator.
> >
> > Thanks,
> > Kevin
> >
> > On Wed, Mar 13, 2024 at 9:41 AM Maximilian Michels 
> wrote:
> >
> > > Hi Kevin,
> > >
> > > Theoretically, as long as you move over all k8s resources, failover
> > > should work fine on the Flink and Flink Operator side. The tricky part
> > > is the handover. You will need to backup all resources from the old
> > > cluster, shutdown the old cluster, then re-create them on the new
> > > cluster. The operator deployment and the Flink cluster should then
> > > recover fine (assuming that high availability has been configured and
> > > checkpointing is done to persistent storage available in the new
> > > cluster). The operator state / Flink state is actually kept in
> > > ConfigMaps which would be part of the resource dump.
> > >
> > > This method has proven to work in case of Kubernetes cluster upgrades.
> > > Moving to an entirely new cluster is a bit more involved but exporting
> > > all resource definitions and re-importing them into the new cluster
> > > should yield the same result as long as the checkpoint paths do not
> > > change.
> > >
> > > Probably something worth trying :)
> > >
> > > -Max
> > >
> > >
> > >
> > > On Wed, Mar 6, 2024 at 9:09 PM Kevin Lam  >
> > > wrote:
> > > >
> > > > Another thought could be modifying the operator to have a behaviour
> > where
> > > > upon first deploy, it optionally (flag/param enabled) finds the most
> > > recent
> > > > snapshot and uses that as the initialSavepointPath to restore and run
> > the
> > > > Flink job.
> > > >
> > > > On Wed, Mar 6, 2024 at 2:07 PM Kevin Lam 
> > wrote:
> > > >
> > > > > Hi there,
> > > > >
> > > > > We use the Flink Kubernetes Operator, and I am investigating how we
> > can
> > > > > easily support failing over a FlinkDeployment from one Kubernetes
> > > Cluster
> > > > > to another in the case of an outage that requires us to migrate a
> > large
> > > > > number of FlinkDeployments from one K8s cluster to another.
> > > > >
> > > > > I understand one way to do this is to set `initialSavepoint` on all
> > the
> > > > > FlinkDeployments to the most recent/appropriate snapshot so the
> jobs
> > > > > continue from where they left off, but for a large number of jobs,
> > this
> > > > > would be quite a bit of manual labor.
> > > > >
> > > > > Do others have an approach they are using? Any advice?
> > > > >
> > > > > Could this be something addressed in a future FLIP? Perhaps we
> could
> > > store
> > > > > some kind of metadata in object storage so that the Flink
> Kubernetes
> > > > > Operator can restore a FlinkDeployment from where it left off, even
> > if
> > > the
> > > > > job is shifted to another Kubernetes Cluster.
> > > > >
> > > > > Looking forward to hearing folks' thoughts!
> > > > >
> > >
> >
>


Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-20 Thread Gyula Fóra
Sorry for the late reply Kevin.

I think what you are suggesting makes sense, it would be basically a
`last-state` startup mode. This would also help in cases where the current
last-state mechanism fails to locate HA metadata (and the state).

This is somewhat of a tricky feature to implement:
 1. The operator will need FS plugins and access to the different user envs
(this will not work in many prod environments unfortunately)
 2. Flink doesn't expose a good way to detect the latest checkpoint just by
looking at the FS so we need to figure out something here. Probably some
changes are necessary on Flink core side as well

Gyula


Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-21 Thread Kevin Lam
No worries, thanks for the reply Gyula.

Ah yes, I see how those points you raised make the feature tricky to
implement.
Could this be considered for a FLIP (or two) in the future?

On Wed, Mar 20, 2024 at 2:21 PM Gyula Fóra  wrote:

> Sorry for the late reply Kevin.
>
> I think what you are suggesting makes sense, it would be basically a
> `last-state` startup mode. This would also help in cases where the current
> last-state mechanism fails to locate HA metadata (and the state).
>
> This is somewhat of a tricky feature to implement:
>  1. The operator will need FS plugins and access to the different user envs
> (this will not work in many prod environments unfortunately)
>  2. Flink doesn't expose a good way to detect the latest checkpoint just by
> looking at the FS so we need to figure out something here. Probably some
> changes are necessary on Flink core side as well
>
> Gyula
>


Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-22 Thread Gyula Fóra
I agree, we would need some FLIPs to cover this. Actually there is already
some work on this topic initiated by Matthias Pohl (ccd).
Please see this:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-360%3A+Merging+the+ExecutionGraphInfoStore+and+the+JobResultStore+into+a+single+component+CompletedJobStore

This FLIP actually covers some of these limitations already and other
outstanding issues in the operator.

Cheers,
Gyula


Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-22 Thread Kevin Lam
Thanks for sharing this work Gyula! That's great to see the FLIP covers
some of the limitations already. I will follow the FLIP and associated JIRA
ticket.

Hi Matthias Pohl. I'd be interested to learn if there has been any progress
on the FLIP-360 or associated JIRA issue FLINK-31709.

On Fri, Mar 22, 2024 at 3:47 AM Gyula Fóra  wrote:

> I agree, we would need some FLIPs to cover this. Actually there is already
> some work on this topic initiated by Matthias Pohl (ccd).
> Please see this:
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-360%3A+Merging+the+ExecutionGraphInfoStore+and+the+JobResultStore+into+a+single+component+CompletedJobStore
>
> This FLIP actually covers some of these limitations already and other
> outstanding issues in the operator.
>
> Cheers,
> Gyula
>