Re: Failed to resume from HA when the checkpoint has been deleted.

Zhanghao Chen Tue, 11 Jun 2024 01:43:39 -0700

There's no such option yet. However, it might not be a good idea to silently 
ignore the exception and restart from fresh state which violates the data 
integrity. Instead, the job should be marked as terminally failed in this case 
(maybe after a few retries) and just leave users or an external job monitoring 
system to manually recover it.

Best,
Zhanghao Chen
________________________________
From: Jean-Marc Paulin <j...@uk.ibm.com>
Sent: Tuesday, June 11, 2024 16:04
To: Zhanghao Chen <zhanghao.c...@outlook.com>; user@flink.apache.org 
<user@flink.apache.org>
Subject: Re: Failed to resume from HA when the checkpoint has been deleted.

Thanks for you reply,

Yes, this is indeed an option. But I was more after a config option to handle 
that scenario. If the HA metadata points to a checkpoint that is obviously not 
present (err 404 in the S3 case) there is little value in retrying. The HA data 
are obviously worthless in that scenario.

But maybe there isn't any.

Best regards

JM
________________________________
From: Zhanghao Chen <zhanghao.c...@outlook.com>
Sent: Tuesday, June 11, 2024 03:56
To: Jean-Marc Paulin <j...@uk.ibm.com>; user@flink.apache.org 
<user@flink.apache.org>
Subject: [EXTERNAL] Re: Failed to resume from HA when the checkpoint has been 
deleted.

Hi, In this case, you could cancel the job using the flink stop command, which 
will clean up Flink HA metadata, and resubmit the job. Best, Zhanghao Chen 
From: Jean-Marc Paulin <jmp@ uk. ibm. com> Sent: Monday, June 10, 2024 18: 53 
To: user@ flink. apache. org
Hi,

In this case, you could cancel the job using the flink stop command, which 
will clean up Flink HA metadata, and resubmit the job.

Best,
Zhanghao Chen
________________________________
From: Jean-Marc Paulin <j...@uk.ibm.com>
Sent: Monday, June 10, 2024 18:53
To: user@flink.apache.org <user@flink.apache.org>
Subject: Failed to resume from HA when the checkpoint has been deleted.

Hi,

We have a 1.19 Flink streaming job, with HA enabled (ZooKeeper), 
checkpoint/savepoint in S3. We had an outage and now the jobmanager keeps 
restarting. We think it because it read the job id to be restarted from 
ZooKeeper, but because we lost our S3 Storage as part of the outage it cannot 
find the checkpoint to restart from, and dies.

```
Found 1 checkpoints in 
ZooKeeperStateHandleStore{namespace='flink/aiops/ir-lifecycle/jobs/2512c6153c7ae16fa6da6d64772d75c5/checkpoints'
Trying to fetch 1 checkpoints from storage.
Trying to retrieve checkpoint 50417.

exception: JobMaster for job 2512c6153c7ae16fa6da6d64772d75c5 failed.
Caused by: org.apache.flink.runtime.client.JobInitializationException: Could 
not start the JobMaster.
Caused by: java.util.concurrent.CompletionException: 
java.lang.RuntimeException: 
org.apache.flink.runtime.client.JobExecutionException: Failed to initialize 
high-availability completed checkpoint store
...
Caused by: org.apache.flink.util.FlinkException: Could not retrieve checkpoint 
50417 from state handle under /0000000000000050417. This indicates that the 
retrieved state handle is broken. Try cleaning the state handle store.
Caused by: 
com.facebook.presto.hive.s3.PrestoS3FileSystem$UnrecoverableS3OperationException:
 com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not 
exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request 
ID: 17D7166A4D756355; S3 Extended Request ID: 
fe09be003d6379d952fad9de241c370b5f7ac43631c02fdfbc9dda9c4398d6df; Proxy: null), 
S3 Extended Request ID: 
fe09be003d6379d952fad9de241c370b5f7ac43631c02fdfbc9dda9c4398d6df (Path: 
s3://test/high-availability/flink-job/completedCheckpoint64d901465702)

Fatal error occurred in the cluster entrypoint.
```

Is there an option we can use to configure the job to ignore this error?

Kind regards

JM

Unless otherwise stated above:

IBM United Kingdom Limited
Registered in England and Wales with number 741598
Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
Unless otherwise stated above:

IBM United Kingdom Limited
Registered in England and Wales with number 741598
Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU

Re: Failed to resume from HA when the checkpoint has been deleted.

Reply via email to