Hi!

I think this issue is the same as
https://issues.apache.org/jira/browse/FLINK-33011
Not sure what exactly is the underlying cause as I could not repro it, but
the fix should be simple.

Also I believe it's not 1.6.0 related unless a JOSDK/Fabric8 upgrade caused
it.

Cheers,
Gyula


On Mon, Sep 11, 2023 at 7:47 PM Gyula Fóra <gyula.f...@gmail.com> wrote:

> You don’t need it but you can really mess up clusters by rolling back CRD
> changes…
>
> On Mon, 11 Sep 2023 at 19:42, Evgeniy Lyutikov <eblyuti...@avito.ru>
> wrote:
>
>> Why we need to use latest CRD version with older operator version?
>> ------------------------------
>> *От:* Gyula Fóra <gyula.f...@gmail.com>
>> *Отправлено:* 12 сентября 2023 г. 0:36:26
>>
>> *Кому:* Evgeniy Lyutikov
>> *Копия:* user@flink.apache.org
>> *Тема:* Re: Flink kubernets operator delete HA metadata after resuming
>> from suspend
>>
>> Do not change the CRD but you can roll back the operator itself I believe
>>
>> Gyula
>>
>> On Mon, 11 Sep 2023 at 18:52, Evgeniy Lyutikov <eblyuti...@avito.ru>
>> wrote:
>>
>>> Is it safe to rollback the operator version with replace to old CRDs?
>>> ------------------------------
>>> *От:* Evgeniy Lyutikov <eblyuti...@avito.ru>
>>> *Отправлено:* 11 сентября 2023 г. 23:50:26
>>> *Кому:* Gyula Fóra
>>>
>>> *Копия:* user@flink.apache.org
>>> *Тема:* Re: Flink kubernets operator delete HA metadata after resuming
>>> from suspend
>>>
>>>
>>> Hi!
>>> No, no one could restart jobmanager,
>>> I monitored the pods in real time, they all deleted when suspended as
>>> expected.
>>>
>>>
>>> ------------------------------
>>> *От:* Gyula Fóra <gyula.f...@gmail.com>
>>> *Отправлено:* 11 сентября 2023 г. 20:34:52
>>> *Кому:* Evgeniy Lyutikov
>>> *Копия:* user@flink.apache.org
>>> *Тема:* Re: Flink kubernets operator delete HA metadata after resuming
>>> from suspend
>>>
>>> Hi!
>>>
>>> I could not reproduce your issue, last-state suspend/restore seems to
>>> work as before.
>>> However these 2 logs seem very suspicious:
>>>
>>> 2023-09-11 06:02:07,481 o.a.f.k.o.o.d.ApplicationObserver [INFO
>>> ][rec-job/rec-job] Observing JobManager deployment. Previous status: MISSING
>>> 2023-09-11 06:02:07,488 o.a.f.k.o.o.d.ApplicationObserver [INFO
>>> ][rec-job/rec-job] JobManager is being deployed
>>>
>>> Looks like after suspending (and deleting the JobManager Deployment)
>>> somebody restarted the JobManager manually. Is that possible?
>>>
>>> Cheers,
>>> Gyula
>>>
>>> On Mon, Sep 11, 2023 at 2:59 PM Evgeniy Lyutikov <eblyuti...@avito.ru>
>>> wrote:
>>>
>>>> Hi all!
>>>> After updating the operator to version 1.6.0, suspended and resuming
>>>> flink jobs stopped working.
>>>> When job resumes, the high availability metadata is removed.
>>>>
>>>> Suspend job:
>>>> 2023-09-11 06:01:41,548 o.a.f.k.o.l.AuditUtils         [INFO
>>>> ][rec-job/rec-job] >>> Event  | Info    | SPECCHANGED     | UPGRADE
>>>> change(s) detected (Diff: FlinkDeploymentSpec[job.state : running ->
>>>> suspended]), starting reconciliation.
>>>> 2023-09-11 06:01:41,548 o.a.f.k.o.r.d.AbstractJobReconciler [INFO
>>>> ][rec-job/rec-job] Job is in running state, ready for upgrade with
>>>> LAST_STATE
>>>> 2023-09-11 06:01:41,558 o.a.f.k.o.l.AuditUtils         [INFO
>>>> ][rec-job/rec-job] >>> Event  | Info    | SUSPENDED       | Suspending
>>>> existing deployment.
>>>> 2023-09-11 06:01:41,558 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>> ][rec-job/rec-job] Deleting cluster with Foreground propagation
>>>> 2023-09-11 06:01:41,558 o.a.f.k.o.s.NativeFlinkService [INFO
>>>> ][rec-job/rec-job] Deleting JobManager deployment while preserving HA
>>>> metadata.
>>>> 2023-09-11 06:01:41,598 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>> ][rec-job/rec-job] Waiting for cluster shutdown...
>>>> 2023-09-11 06:01:45,667 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>> ][rec-job/rec-job] Waiting for cluster shutdown... (5s)
>>>> 2023-09-11 06:01:50,730 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>> ][rec-job/rec-job] Waiting for cluster shutdown... (10s)
>>>> 2023-09-11 06:01:55,837 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>> ][rec-job/rec-job] Waiting for cluster shutdown... (15s)
>>>> 2023-09-11 06:02:00,885 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>> ][rec-job/rec-job] Waiting for cluster shutdown... (20s)
>>>> 2023-09-11 06:02:01,895 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>> ][rec-job/rec-job] Cluster shutdown completed.
>>>> 2023-09-11 06:02:01,973 o.a.f.k.o.l.AuditUtils         [INFO
>>>> ][rec-job/rec-job] >>> Status | Info    | SUSPENDED       | The resource
>>>> (job) has been suspended
>>>> 2023-09-11 06:02:01,981 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler
>>>> [INFO ][rec-job/rec-job] Resource fully reconciled, nothing to do...
>>>>
>>>> Resume:
>>>> 2023-09-11 06:02:07,481 o.a.f.k.o.o.d.ApplicationObserver [INFO
>>>> ][rec-job/rec-job] Observing JobManager deployment. Previous status: 
>>>> MISSING
>>>> 2023-09-11 06:02:07,488 o.a.f.k.o.o.d.ApplicationObserver [INFO
>>>> ][rec-job/rec-job] JobManager is being deployed
>>>> 2023-09-11 06:02:07,563 o.a.f.k.o.l.AuditUtils         [INFO
>>>> ][rec-job/rec-job] >>> Status | Info    | SUSPENDED       | The resource
>>>> (job) has been suspended
>>>> 2023-09-11 06:02:07,576 o.a.f.k.o.l.AuditUtils         [INFO
>>>> ][rec-job/rec-job] >>> Event  | Info    | SPECCHANGED     | UPGRADE
>>>> change(s) detected (Diff: FlinkDeploymentSpec[job.state : suspended ->
>>>> running]), starting reconciliation.
>>>> 2023-09-11 06:02:07,649 o.a.f.k.o.l.AuditUtils         [INFO
>>>> ][rec-job/rec-job] >>> Status | Info    | UPGRADING       | The resource is
>>>> being upgraded
>>>> 2023-09-11 06:02:07,649 o.a.f.k.o.r.d.ApplicationReconciler [INFO
>>>> ][rec-job/rec-job] Deleting deployment with terminated application before
>>>> new deployment
>>>> 2023-09-11 06:02:07,649 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>> ][rec-job/rec-job] Deleting cluster with Foreground propagation
>>>> 2023-09-11 06:02:07,649 o.a.f.k.o.s.NativeFlinkService [INFO
>>>> ][rec-job/rec-job] Deleting JobManager deployment and HA metadata.
>>>> 2023-09-11 06:02:07,691 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>> ][rec-job/rec-job] Waiting for cluster shutdown...
>>>> 2023-09-11 06:02:07,763 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>> ][rec-job/rec-job] Cluster shutdown completed.
>>>> 2023-09-11 06:02:07,763 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>> ][rec-job/rec-job] Deleting Kubernetes HA metadata
>>>> 2023-09-11 06:02:07,820 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>> ][rec-job/rec-job] Waiting for cluster shutdown...
>>>> 2023-09-11 06:02:07,831 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>> ][rec-job/rec-job] Cluster shutdown completed.
>>>> 2023-09-11 06:02:07,975 o.a.f.k.o.l.AuditUtils         [INFO
>>>> ][rec-job/rec-job] >>> Status | Info    | UPGRADING       | The resource is
>>>> being upgraded
>>>> 2023-09-11 06:02:07,987 o.a.f.k.o.l.AuditUtils         [INFO
>>>> ][rec-job/rec-job] >>> Event  | Info    | SUBMIT          | Starting
>>>> deployment
>>>> 2023-09-11 06:02:07,987 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>> ][rec-job/rec-job] Deploying application cluster requiring last-state from
>>>> HA metadata
>>>> 2023-09-11 06:02:07,999 o.a.f.k.o.c.FlinkDeploymentController
>>>> [ERROR][rec-job/rec-job] Flink recovery failed
>>>> 2023-09-11 06:02:08,012 o.a.f.k.o.l.AuditUtils         [INFO
>>>> ][rec-job/rec-job] >>> Event  | Warning | RESTOREFAILED   | HA metadata not
>>>> available to restore from last state. It is possible that the job has
>>>> finished or terminally failed, or the configmaps have been deleted. Manual
>>>> restore required.
>>>> 2023-09-11 06:02:08,099 o.a.f.k.o.l.AuditUtils         [INFO
>>>> ][rec-job/rec-job] >>> Status | Error   | UPGRADING       |
>>>> {"type":"org.apache.flink.kubernetes.operator.exception.RecoveryFailureException","message":"HA
>>>> metadata not available to restore from last state. It is possible that the
>>>> job has finished or terminally failed, or the configmaps have been deleted.
>>>> Manual restore required.","additionalMetadata":{},"throwableList":[]}
>>>> 2023-09-11 06:02:08,193 o.a.f.k.o.l.AuditUtils         [INFO
>>>> ][rec-job/rec-job] >>> Status | Info    | UPGRADING       | The resource is
>>>> being upgraded
>>>> 2023-09-11 06:02:08,218 o.a.f.k.o.l.AuditUtils         [INFO
>>>> ][rec-job/rec-job] >>> Event  | Info    | SUBMIT          | Starting
>>>> deployment
>>>> 2023-09-11 06:02:08,218 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>> ][rec-job/rec-job] Deploying application cluster requiring last-state from
>>>> HA metadata
>>>> 2023-09-11 06:02:08,228 o.a.f.k.o.c.FlinkDeploymentController
>>>> [ERROR][rec-job/rec-job] Flink recovery failed
>>>>
>>>>
>>>>
>>>>
>>>> * ------------------------------ *“This message contains confidential
>>>> information/commercial secret. If you are not the intended addressee of
>>>> this message you may not copy, save, print or forward it to any third party
>>>> and you are kindly requested to destroy this message and notify the sender
>>>> thereof by email.
>>>> Данное сообщение содержит конфиденциальную информацию/информацию,
>>>> являющуюся коммерческой тайной. Если Вы не являетесь надлежащим адресатом
>>>> данного сообщения, Вы не вправе копировать, сохранять, печатать или
>>>> пересылать его каким либо иным лицам. Просьба уничтожить данное сообщение и
>>>> уведомить об этом отправителя электронным письмом.”
>>>>
>>>

Reply via email to