Hi! I think this issue is the same as https://issues.apache.org/jira/browse/FLINK-33011 Not sure what exactly is the underlying cause as I could not repro it, but the fix should be simple.
Also I believe it's not 1.6.0 related unless a JOSDK/Fabric8 upgrade caused it. Cheers, Gyula On Mon, Sep 11, 2023 at 7:47 PM Gyula Fóra <gyula.f...@gmail.com> wrote: > You don’t need it but you can really mess up clusters by rolling back CRD > changes… > > On Mon, 11 Sep 2023 at 19:42, Evgeniy Lyutikov <eblyuti...@avito.ru> > wrote: > >> Why we need to use latest CRD version with older operator version? >> ------------------------------ >> *От:* Gyula Fóra <gyula.f...@gmail.com> >> *Отправлено:* 12 сентября 2023 г. 0:36:26 >> >> *Кому:* Evgeniy Lyutikov >> *Копия:* user@flink.apache.org >> *Тема:* Re: Flink kubernets operator delete HA metadata after resuming >> from suspend >> >> Do not change the CRD but you can roll back the operator itself I believe >> >> Gyula >> >> On Mon, 11 Sep 2023 at 18:52, Evgeniy Lyutikov <eblyuti...@avito.ru> >> wrote: >> >>> Is it safe to rollback the operator version with replace to old CRDs? >>> ------------------------------ >>> *От:* Evgeniy Lyutikov <eblyuti...@avito.ru> >>> *Отправлено:* 11 сентября 2023 г. 23:50:26 >>> *Кому:* Gyula Fóra >>> >>> *Копия:* user@flink.apache.org >>> *Тема:* Re: Flink kubernets operator delete HA metadata after resuming >>> from suspend >>> >>> >>> Hi! >>> No, no one could restart jobmanager, >>> I monitored the pods in real time, they all deleted when suspended as >>> expected. >>> >>> >>> ------------------------------ >>> *От:* Gyula Fóra <gyula.f...@gmail.com> >>> *Отправлено:* 11 сентября 2023 г. 20:34:52 >>> *Кому:* Evgeniy Lyutikov >>> *Копия:* user@flink.apache.org >>> *Тема:* Re: Flink kubernets operator delete HA metadata after resuming >>> from suspend >>> >>> Hi! >>> >>> I could not reproduce your issue, last-state suspend/restore seems to >>> work as before. >>> However these 2 logs seem very suspicious: >>> >>> 2023-09-11 06:02:07,481 o.a.f.k.o.o.d.ApplicationObserver [INFO >>> ][rec-job/rec-job] Observing JobManager deployment. Previous status: MISSING >>> 2023-09-11 06:02:07,488 o.a.f.k.o.o.d.ApplicationObserver [INFO >>> ][rec-job/rec-job] JobManager is being deployed >>> >>> Looks like after suspending (and deleting the JobManager Deployment) >>> somebody restarted the JobManager manually. Is that possible? >>> >>> Cheers, >>> Gyula >>> >>> On Mon, Sep 11, 2023 at 2:59 PM Evgeniy Lyutikov <eblyuti...@avito.ru> >>> wrote: >>> >>>> Hi all! >>>> After updating the operator to version 1.6.0, suspended and resuming >>>> flink jobs stopped working. >>>> When job resumes, the high availability metadata is removed. >>>> >>>> Suspend job: >>>> 2023-09-11 06:01:41,548 o.a.f.k.o.l.AuditUtils [INFO >>>> ][rec-job/rec-job] >>> Event | Info | SPECCHANGED | UPGRADE >>>> change(s) detected (Diff: FlinkDeploymentSpec[job.state : running -> >>>> suspended]), starting reconciliation. >>>> 2023-09-11 06:01:41,548 o.a.f.k.o.r.d.AbstractJobReconciler [INFO >>>> ][rec-job/rec-job] Job is in running state, ready for upgrade with >>>> LAST_STATE >>>> 2023-09-11 06:01:41,558 o.a.f.k.o.l.AuditUtils [INFO >>>> ][rec-job/rec-job] >>> Event | Info | SUSPENDED | Suspending >>>> existing deployment. >>>> 2023-09-11 06:01:41,558 o.a.f.k.o.s.AbstractFlinkService [INFO >>>> ][rec-job/rec-job] Deleting cluster with Foreground propagation >>>> 2023-09-11 06:01:41,558 o.a.f.k.o.s.NativeFlinkService [INFO >>>> ][rec-job/rec-job] Deleting JobManager deployment while preserving HA >>>> metadata. >>>> 2023-09-11 06:01:41,598 o.a.f.k.o.s.AbstractFlinkService [INFO >>>> ][rec-job/rec-job] Waiting for cluster shutdown... >>>> 2023-09-11 06:01:45,667 o.a.f.k.o.s.AbstractFlinkService [INFO >>>> ][rec-job/rec-job] Waiting for cluster shutdown... (5s) >>>> 2023-09-11 06:01:50,730 o.a.f.k.o.s.AbstractFlinkService [INFO >>>> ][rec-job/rec-job] Waiting for cluster shutdown... (10s) >>>> 2023-09-11 06:01:55,837 o.a.f.k.o.s.AbstractFlinkService [INFO >>>> ][rec-job/rec-job] Waiting for cluster shutdown... (15s) >>>> 2023-09-11 06:02:00,885 o.a.f.k.o.s.AbstractFlinkService [INFO >>>> ][rec-job/rec-job] Waiting for cluster shutdown... (20s) >>>> 2023-09-11 06:02:01,895 o.a.f.k.o.s.AbstractFlinkService [INFO >>>> ][rec-job/rec-job] Cluster shutdown completed. >>>> 2023-09-11 06:02:01,973 o.a.f.k.o.l.AuditUtils [INFO >>>> ][rec-job/rec-job] >>> Status | Info | SUSPENDED | The resource >>>> (job) has been suspended >>>> 2023-09-11 06:02:01,981 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler >>>> [INFO ][rec-job/rec-job] Resource fully reconciled, nothing to do... >>>> >>>> Resume: >>>> 2023-09-11 06:02:07,481 o.a.f.k.o.o.d.ApplicationObserver [INFO >>>> ][rec-job/rec-job] Observing JobManager deployment. Previous status: >>>> MISSING >>>> 2023-09-11 06:02:07,488 o.a.f.k.o.o.d.ApplicationObserver [INFO >>>> ][rec-job/rec-job] JobManager is being deployed >>>> 2023-09-11 06:02:07,563 o.a.f.k.o.l.AuditUtils [INFO >>>> ][rec-job/rec-job] >>> Status | Info | SUSPENDED | The resource >>>> (job) has been suspended >>>> 2023-09-11 06:02:07,576 o.a.f.k.o.l.AuditUtils [INFO >>>> ][rec-job/rec-job] >>> Event | Info | SPECCHANGED | UPGRADE >>>> change(s) detected (Diff: FlinkDeploymentSpec[job.state : suspended -> >>>> running]), starting reconciliation. >>>> 2023-09-11 06:02:07,649 o.a.f.k.o.l.AuditUtils [INFO >>>> ][rec-job/rec-job] >>> Status | Info | UPGRADING | The resource is >>>> being upgraded >>>> 2023-09-11 06:02:07,649 o.a.f.k.o.r.d.ApplicationReconciler [INFO >>>> ][rec-job/rec-job] Deleting deployment with terminated application before >>>> new deployment >>>> 2023-09-11 06:02:07,649 o.a.f.k.o.s.AbstractFlinkService [INFO >>>> ][rec-job/rec-job] Deleting cluster with Foreground propagation >>>> 2023-09-11 06:02:07,649 o.a.f.k.o.s.NativeFlinkService [INFO >>>> ][rec-job/rec-job] Deleting JobManager deployment and HA metadata. >>>> 2023-09-11 06:02:07,691 o.a.f.k.o.s.AbstractFlinkService [INFO >>>> ][rec-job/rec-job] Waiting for cluster shutdown... >>>> 2023-09-11 06:02:07,763 o.a.f.k.o.s.AbstractFlinkService [INFO >>>> ][rec-job/rec-job] Cluster shutdown completed. >>>> 2023-09-11 06:02:07,763 o.a.f.k.o.s.AbstractFlinkService [INFO >>>> ][rec-job/rec-job] Deleting Kubernetes HA metadata >>>> 2023-09-11 06:02:07,820 o.a.f.k.o.s.AbstractFlinkService [INFO >>>> ][rec-job/rec-job] Waiting for cluster shutdown... >>>> 2023-09-11 06:02:07,831 o.a.f.k.o.s.AbstractFlinkService [INFO >>>> ][rec-job/rec-job] Cluster shutdown completed. >>>> 2023-09-11 06:02:07,975 o.a.f.k.o.l.AuditUtils [INFO >>>> ][rec-job/rec-job] >>> Status | Info | UPGRADING | The resource is >>>> being upgraded >>>> 2023-09-11 06:02:07,987 o.a.f.k.o.l.AuditUtils [INFO >>>> ][rec-job/rec-job] >>> Event | Info | SUBMIT | Starting >>>> deployment >>>> 2023-09-11 06:02:07,987 o.a.f.k.o.s.AbstractFlinkService [INFO >>>> ][rec-job/rec-job] Deploying application cluster requiring last-state from >>>> HA metadata >>>> 2023-09-11 06:02:07,999 o.a.f.k.o.c.FlinkDeploymentController >>>> [ERROR][rec-job/rec-job] Flink recovery failed >>>> 2023-09-11 06:02:08,012 o.a.f.k.o.l.AuditUtils [INFO >>>> ][rec-job/rec-job] >>> Event | Warning | RESTOREFAILED | HA metadata not >>>> available to restore from last state. It is possible that the job has >>>> finished or terminally failed, or the configmaps have been deleted. Manual >>>> restore required. >>>> 2023-09-11 06:02:08,099 o.a.f.k.o.l.AuditUtils [INFO >>>> ][rec-job/rec-job] >>> Status | Error | UPGRADING | >>>> {"type":"org.apache.flink.kubernetes.operator.exception.RecoveryFailureException","message":"HA >>>> metadata not available to restore from last state. It is possible that the >>>> job has finished or terminally failed, or the configmaps have been deleted. >>>> Manual restore required.","additionalMetadata":{},"throwableList":[]} >>>> 2023-09-11 06:02:08,193 o.a.f.k.o.l.AuditUtils [INFO >>>> ][rec-job/rec-job] >>> Status | Info | UPGRADING | The resource is >>>> being upgraded >>>> 2023-09-11 06:02:08,218 o.a.f.k.o.l.AuditUtils [INFO >>>> ][rec-job/rec-job] >>> Event | Info | SUBMIT | Starting >>>> deployment >>>> 2023-09-11 06:02:08,218 o.a.f.k.o.s.AbstractFlinkService [INFO >>>> ][rec-job/rec-job] Deploying application cluster requiring last-state from >>>> HA metadata >>>> 2023-09-11 06:02:08,228 o.a.f.k.o.c.FlinkDeploymentController >>>> [ERROR][rec-job/rec-job] Flink recovery failed >>>> >>>> >>>> >>>> >>>> * ------------------------------ *“This message contains confidential >>>> information/commercial secret. If you are not the intended addressee of >>>> this message you may not copy, save, print or forward it to any third party >>>> and you are kindly requested to destroy this message and notify the sender >>>> thereof by email. >>>> Данное сообщение содержит конфиденциальную информацию/информацию, >>>> являющуюся коммерческой тайной. Если Вы не являетесь надлежащим адресатом >>>> данного сообщения, Вы не вправе копировать, сохранять, печатать или >>>> пересылать его каким либо иным лицам. Просьба уничтожить данное сообщение и >>>> уведомить об этом отправителя электронным письмом.” >>>> >>>