[ https://issues.apache.org/jira/browse/FLINK-33011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772350#comment-17772350 ]
Mason Chen commented on FLINK-33011: ------------------------------------ [~gyfora] could you backport this to 1.6? We are also hitting this bug in 1.6 > Operator deletes HA data unexpectedly > ------------------------------------- > > Key: FLINK-33011 > URL: https://issues.apache.org/jira/browse/FLINK-33011 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Affects Versions: 1.17.1, kubernetes-operator-1.6.0 > Environment: Flink: 1.17.1 > Flink Kubernetes Operator: 1.6.0 > Reporter: Ruibin Xing > Assignee: Gyula Fora > Priority: Blocker > Labels: pull-request-available > Fix For: kubernetes-operator-1.7.0 > > Attachments: flink_operator_logs_0831.csv > > > We encountered a problem where the operator unexpectedly deleted HA data. > The timeline is as follows: > 12:08 We submitted the first spec, which suspended the job with savepoint > upgrade mode. > 12:08 The job was suspended, while the HA data was preserved, and the log > showed the observed job deployment status was MISSING. > 12:10 We submitted the second spec, which deployed the job with the last > state upgrade mode. > 12:10 Logs showed the operator deleted both the Flink deployment and the HA > data again. > 12:10 The job failed to start because the HA data was missing. > According to the log, the deletion was triggered by > https://github.com/apache/flink-kubernetes-operator/blob/a728ba768e20236184e2b9e9e45163304b8b196c/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java#L168 > I think this would only be triggered if the job deployment status wasn't > MISSING. But the log before the deletion showed the observed job status was > MISSING at that moment. > Related logs: > > {code:java} > 2023-08-30 12:08:48.190 +0000 o.a.f.k.o.s.AbstractFlinkService [INFO > ][default/pipeline-pipeline-se-3] Cluster shutdown completed. > 2023-08-30 12:10:27.010 +0000 o.a.f.k.o.o.d.ApplicationObserver [INFO > ][default/pipeline-pipeline-se-3] Observing JobManager deployment. Previous > status: MISSING > 2023-08-30 12:10:27.533 +0000 o.a.f.k.o.l.AuditUtils [INFO > ][default/pipeline-pipeline-se-3] >>> Event | Info | SPECCHANGED | > UPGRADE change(s) detected (Diff: FlinkDeploymentSpec[image : > docker-registry.randomcompany.com/octopus/pipeline-pipeline-online:0835137c-362 > -> > docker-registry.randomcompany.com/octopus/pipeline-pipeline-online:23db7ae8-365, > podTemplate.metadata.labels.app.kubernetes.io~1version : > 0835137cd803b7258695eb53a6ec520cb62a48a7 -> > 23db7ae84bdab8d91fa527fe2f8f2fce292d0abc, job.state : suspended -> running, > job.upgradeMode : last-state -> savepoint, restartNonce : 1545 -> 1547]), > starting reconciliation. > 2023-08-30 12:10:27.679 +0000 o.a.f.k.o.s.NativeFlinkService [INFO > ][default/pipeline-pipeline-se-3] Deleting JobManager deployment and HA > metadata. > {code} > A more complete log file is attached. Thanks. -- This message was sent by Atlassian Jira (v8.20.10#820010)