[ 
https://issues.apache.org/jira/browse/FLINK-33011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772350#comment-17772350
 ] 

Mason Chen commented on FLINK-33011:
------------------------------------

[~gyfora] could you backport this to 1.6? We are also hitting this bug in 1.6

> Operator deletes HA data unexpectedly
> -------------------------------------
>
>                 Key: FLINK-33011
>                 URL: https://issues.apache.org/jira/browse/FLINK-33011
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: 1.17.1, kubernetes-operator-1.6.0
>         Environment: Flink: 1.17.1
> Flink Kubernetes Operator: 1.6.0
>            Reporter: Ruibin Xing
>            Assignee: Gyula Fora
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: kubernetes-operator-1.7.0
>
>         Attachments: flink_operator_logs_0831.csv
>
>
> We encountered a problem where the operator unexpectedly deleted HA data.
> The timeline is as follows:
> 12:08 We submitted the first spec, which suspended the job with savepoint 
> upgrade mode.
> 12:08 The job was suspended, while the HA data was preserved, and the log 
> showed the observed job deployment status was MISSING.
> 12:10 We submitted the second spec, which deployed the job with the last 
> state upgrade mode.
> 12:10 Logs showed the operator deleted both the Flink deployment and the HA 
> data again.
> 12:10 The job failed to start because the HA data was missing.
> According to the log, the deletion was triggered by 
> https://github.com/apache/flink-kubernetes-operator/blob/a728ba768e20236184e2b9e9e45163304b8b196c/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java#L168
> I think this would only be triggered if the job deployment status wasn't 
> MISSING. But the log before the deletion showed the observed job status was 
> MISSING at that moment.
> Related logs:
>  
> {code:java}
> 2023-08-30 12:08:48.190 +0000 o.a.f.k.o.s.AbstractFlinkService [INFO 
> ][default/pipeline-pipeline-se-3] Cluster shutdown completed.
> 2023-08-30 12:10:27.010 +0000 o.a.f.k.o.o.d.ApplicationObserver [INFO 
> ][default/pipeline-pipeline-se-3] Observing JobManager deployment. Previous 
> status: MISSING
> 2023-08-30 12:10:27.533 +0000 o.a.f.k.o.l.AuditUtils         [INFO 
> ][default/pipeline-pipeline-se-3] >>> Event  | Info    | SPECCHANGED     | 
> UPGRADE change(s) detected (Diff: FlinkDeploymentSpec[image : 
> docker-registry.randomcompany.com/octopus/pipeline-pipeline-online:0835137c-362
>  -> 
> docker-registry.randomcompany.com/octopus/pipeline-pipeline-online:23db7ae8-365,
>  podTemplate.metadata.labels.app.kubernetes.io~1version : 
> 0835137cd803b7258695eb53a6ec520cb62a48a7 -> 
> 23db7ae84bdab8d91fa527fe2f8f2fce292d0abc, job.state : suspended -> running, 
> job.upgradeMode : last-state -> savepoint, restartNonce : 1545 -> 1547]), 
> starting reconciliation.
> 2023-08-30 12:10:27.679 +0000 o.a.f.k.o.s.NativeFlinkService [INFO 
> ][default/pipeline-pipeline-se-3] Deleting JobManager deployment and HA 
> metadata.
> {code}
> A more complete log file is attached. Thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to