Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-22 Thread Kevin Lam
Thanks for sharing this work Gyula! That's great to see the FLIP covers some of the limitations already. I will follow the FLIP and associated JIRA ticket. Hi Matthias Pohl. I'd be interested to learn if there has been any progress on the FLIP-360 or associated JIRA issue FLINK-31709. On Fri,

Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-22 Thread Gyula Fóra
I agree, we would need some FLIPs to cover this. Actually there is already some work on this topic initiated by Matthias Pohl (ccd). Please see this:

Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-21 Thread Kevin Lam
No worries, thanks for the reply Gyula. Ah yes, I see how those points you raised make the feature tricky to implement. Could this be considered for a FLIP (or two) in the future? On Wed, Mar 20, 2024 at 2:21 PM Gyula Fóra wrote: > Sorry for the late reply Kevin. > > I think what you are

Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-20 Thread Gyula Fóra
Sorry for the late reply Kevin. I think what you are suggesting makes sense, it would be basically a `last-state` startup mode. This would also help in cases where the current last-state mechanism fails to locate HA metadata (and the state). This is somewhat of a tricky feature to implement: 1.

Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-14 Thread Kevin Lam
Thanks for your response Gyula. Yes I understand, it doesn't really fit nicely into the Kubernetes Operator pattern. I do still wonder about the idea of supporting a feature where upon first deploy, Flink Operator optionally (flag/param enabled) finds the most recent snapshot (in a specified

Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-13 Thread Gyula Fóra
Hey Kevin! The general mismatch I see here is that operators and resources are pretty cluster dependent. The operator itself is running in the same cluster so it feels out of scope to submit resources to different clusters, this doesn't really sound like what any Kubernetes Operator should do in

Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-13 Thread Kevin Lam
Hi Max, It feels a bit hacky to need to back-up the resources directly from the cluster, as opposed to being able to redeploy our checked-in k8s manifests such that they failover correctly, but that makes sense to me and we can look into this approach. Thanks for the suggestion! I'd still be

Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-13 Thread Maximilian Michels
Hi Kevin, Theoretically, as long as you move over all k8s resources, failover should work fine on the Flink and Flink Operator side. The tricky part is the handover. You will need to backup all resources from the old cluster, shutdown the old cluster, then re-create them on the new cluster. The

Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-06 Thread Kevin Lam
Another thought could be modifying the operator to have a behaviour where upon first deploy, it optionally (flag/param enabled) finds the most recent snapshot and uses that as the initialSavepointPath to restore and run the Flink job. On Wed, Mar 6, 2024 at 2:07 PM Kevin Lam wrote: > Hi there,

Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-06 Thread Kevin Lam
Hi there, We use the Flink Kubernetes Operator, and I am investigating how we can easily support failing over a FlinkDeployment from one Kubernetes Cluster to another in the case of an outage that requires us to migrate a large number of FlinkDeployments from one K8s cluster to another. I