Hi Dongwon!

This error mostly occurs when using Flink 1.14 and the Flink cluster goes
into a terminal state. If a Flink job is FAILED/FINISHED (such as it
exhausted the retry strategy), in Flink 1.14 the cluster shuts itself down
and removes the HA metadata.

In these cases the operator will only see that the cluster completely
disappeared and there is no HA metadata and it will throw the error you
mentioned. It does not know what happened and doesn't have any way to
recover checkpoint information.

This is fixed in Flink 1.15 where even after terminal FAILED/FINISHED
states, the jobmanager would not shut down. This allows the operator to
observe this terminal state and actually recover the job even if the HA
metadata was removed.

To summarize, this is mostly caused by Flink 1.14 behaviour that the
operator cannot control. Upgrading to 1.15 allows much more robustness and
should eliminate most of these cases.

Cheers,
Gyula

On Tue, Nov 22, 2022 at 9:43 AM Dongwon Kim <eastcirc...@gmail.com> wrote:

> Hi,
>
> While using a last-state upgrade mode on flink-k8s-operator-1.2.0 and
> flink-1.14.3, we're occasionally facing the following error:
>
> Status:
>>   Cluster Info:
>>     Flink - Revision:             98997ea @ 2022-01-08T23:23:54+01:00
>>     Flink - Version:              1.14.3
>>   Error:                          HA metadata not available to restore
>> from last state. It is possible that the job has finished or terminally
>> failed, or the configmaps have been deleted. Manual restore required.
>>   Job Manager Deployment Status:  ERROR
>>   Job Status:
>>     Job Id:    e8dd04ea4b03f1817a4a4b9e5282f433
>>     Job Name:  flinktest
>>     Savepoint Info:
>>       Last Periodic Savepoint Timestamp:  0
>>       Savepoint History:
>>       Trigger Id:
>>       Trigger Timestamp:  0
>>       Trigger Type:       UNKNOWN
>>     Start Time:           1668660381400
>>     State:                RECONCILING
>>     Update Time:          1668994910151
>>   Reconciliation Status:
>>     Last Reconciled Spec:  ...
>>     Reconciliation Timestamp:  1668660371853
>>     State:                     DEPLOYED
>>   Task Manager:
>>     Label Selector:  component=taskmanager,app=flinktest
>>     Replicas:        1
>> Events:
>>   Type     Reason            Age                 From
>> Message
>>   ----     ------            ----                ----
>> -------
>>   Normal   JobStatusChanged  30m                 Job
>> Job status changed from RUNNING to RESTARTING
>>   Normal   JobStatusChanged  29m                 Job
>> Job status changed from RESTARTING to CREATED
>>   Normal   JobStatusChanged  28m                 Job
>> Job status changed from CREATED to RESTARTING
>>   Warning  Missing           26m                 JobManagerDeployment
>> Missing JobManager deployment
>>   Warning  RestoreFailed     9s (x106 over 26m)  JobManagerDeployment
>> HA metadata not available to restore from last state. It is possible that
>> the job has finished or terminally failed, or the configmaps have been
>> deleted. Manual restore required.
>>   Normal   Submit            9s (x106 over 26m)  JobManagerDeployment
>> Starting deployment
>
>
> We're happy with the last state mode most of the time, but we face it
> occasionally.
>
> We found that it's not easy to reproduce the problem; we tried to kill JMs
> and TMs and even shutdown the nodes on which JMs and TMs are running.
>
> We also checked that the file size is not zero.
>
> Thanks,
>
> Dongwon
>
>
>

Reply via email to