Nicolas Fraison created FLINK-32012:
---------------------------------------

             Summary: Operator failed to rollback due to missing HA metadata
                 Key: FLINK-32012
                 URL: https://issues.apache.org/jira/browse/FLINK-32012
             Project: Flink
          Issue Type: Bug
          Components: Kubernetes Operator
    Affects Versions: kubernetes-operator-1.4.0
            Reporter: Nicolas Fraison


The operator has well detected that the job was failing and initiate the 
rollback but this rollback has failed due to `Rollback is not possible due to 
missing HA metadata`

We are relying on saevpoint upgrade mode and zookeeper HA.

The operator is performing a set of action to also delete this HA data in 
savepoint upgrade mode:
 * [flink-kubernetes-operator/AbstractFlinkService.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L346]
 : Suspend job with savepoint and deleteClusterDeployment

 * [flink-kubernetes-operator/StandaloneFlinkService.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L158]
 : Remove JM + TM deployment and delete HA data

 * [flink-kubernetes-operator/AbstractFlinkService.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L1008]
 : Wait cluster shutdown and delete zookeeper HA data

 * [flink-kubernetes-operator/FlinkUtils.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L155]
 : Remove all child znode

Then when running rollback the operator is looking for HA data even if we rely 
on sevepoint upgrade mode:
 * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L164]
 Perform reconcile of rollback if it should rollback

 * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L387]
 Rollback failed as HA data is not available

 * [flink-kubernetes-operator/FlinkUtils.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L220]
 Check if some child znodes are available

For both step the pattern looks to be the same for kubernetes HA so it doesn't 
looks to be linked to a bug with zookeeper.

 

>From https://issues.apache.org/jira/browse/FLINK-30305 it looks to be expected 
>that the HA data has been deleted (as it is also performed by flink when 
>relying on savepoint upgrade mode)

So I'm wondering why we enforce such a check when performing rollback if we 
rely on savepoint upgrade mode.

Would it be fine to not rely on the HA data and rollback from the last 
savepoint (the one we used in the deployment step)?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to