Re: Rolling back a bad deployment of FlinkDeployment on kubernetes

2023-10-06 Thread Tony Chen
So, I was able to get the rollback to work after I changed my upgradeMode to *last-state*. Previously, my upgradeMode was *savepoint*, and when I deployed a bad commit, the jobmanager-leader configmap would get deleted. Once I changed the upgradeMode to *last-state*, the configmap was retained when

Re: Rolling back a bad deployment of FlinkDeployment on kubernetes

2023-10-05 Thread Gyula Fóra
Hi Tony! There are still a few corner cases when the operator cannot upgrade / rollback deployments due to the loss of HA metadata (and with that checkpoint information). Most of these issues are not related to the operator logic directly but to how Flink handles certain failures and are related

Re: Rolling back a bad deployment of FlinkDeployment on kubernetes

2023-10-05 Thread Tony Chen
I tried this out with operator version 1.4 and it didn't work for me. I noticed that when I was deploying a bad version, the Kubernetes HA metadata and configmaps were deleted: [m [33m2023-10-05 14:52:17,493 [m [36mo.a.f.k.o.l.AuditUtils [m [32m[INFO ][flink-testing-service/flink-testing-service]

Re: Rolling back a bad deployment of FlinkDeployment on kubernetes

2023-10-05 Thread Tony Chen
I just saw this experimental feature in the documentation: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#application-upgrade-rollbacks-experimental I'm guessing this is the only way to automate rollbacks for now. On Wed, Oct 4, 2023 at

Rolling back a bad deployment of FlinkDeployment on kubernetes

2023-10-04 Thread Tony Chen
Hi Flink Community, I am currently running Apache flink-kubernetes-operator on our kubernetes clusters, and I have Flink applications that are deployed using the FlinkDeployment Custom Resources (CR). I am trying to automate the process of rollbacks and I am running into some issues. I was testin