[ https://issues.apache.org/jira/browse/FLINK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gyula Fora closed FLINK-32012. ------------------------------ Fix Version/s: kubernetes-operator-1.6.0 Resolution: Fixed merged to main d346ca9c437d20042ed8f4a1954f0f0ed438b3ae > Operator failed to rollback due to missing HA metadata > ------------------------------------------------------ > > Key: FLINK-32012 > URL: https://issues.apache.org/jira/browse/FLINK-32012 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Affects Versions: kubernetes-operator-1.4.0 > Reporter: Nicolas Fraison > Priority: Major > Labels: pull-request-available > Fix For: kubernetes-operator-1.6.0 > > > The operator has well detected that the job was failing and initiate the > rollback but this rollback has failed due to `Rollback is not possible due to > missing HA metadata` > We are relying on saevpoint upgrade mode and zookeeper HA. > The operator is performing a set of action to also delete this HA data in > savepoint upgrade mode: > * [flink-kubernetes-operator/AbstractFlinkService.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L346] > : Suspend job with savepoint and deleteClusterDeployment > * [flink-kubernetes-operator/StandaloneFlinkService.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L158] > : Remove JM + TM deployment and delete HA data > * [flink-kubernetes-operator/AbstractFlinkService.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L1008] > : Wait cluster shutdown and delete zookeeper HA data > * [flink-kubernetes-operator/FlinkUtils.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L155] > : Remove all child znode > Then when running rollback the operator is looking for HA data even if we > rely on sevepoint upgrade mode: > * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L164] > Perform reconcile of rollback if it should rollback > * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L387] > Rollback failed as HA data is not available > * [flink-kubernetes-operator/FlinkUtils.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L220] > Check if some child znodes are available > For both step the pattern looks to be the same for kubernetes HA so it > doesn't looks to be linked to a bug with zookeeper. > > From https://issues.apache.org/jira/browse/FLINK-30305 it looks to be > expected that the HA data has been deleted (as it is also performed by flink > when relying on savepoint upgrade mode). > Still the use case seems to differ from > https://issues.apache.org/jira/browse/FLINK-30305 as the operator is aware of > the failure and treat a specific rollback event. > So I'm wondering why we enforce such a check when performing rollback if we > rely on savepoint upgrade mode. Would it be fine to not rely on the HA data > and rollback from the last savepoint (the one we used in the deployment step)? -- This message was sent by Atlassian Jira (v8.20.10#820010)