Re: Failed to resume from HA when the checkpoint has been deleted.

Zhanghao Chen Mon, 10 Jun 2024 19:57:26 -0700

Hi,

In this case, you could cancel the job using the flink stop command, which 
will clean up Flink HA metadata, and resubmit the job.

Best,
Zhanghao Chen
________________________________
From: Jean-Marc Paulin <[email protected]>
Sent: Monday, June 10, 2024 18:53
To: [email protected] <[email protected]>
Subject: Failed to resume from HA when the checkpoint has been deleted.

Hi,

We have a 1.19 Flink streaming job, with HA enabled (ZooKeeper), 
checkpoint/savepoint in S3. We had an outage and now the jobmanager keeps 
restarting. We think it because it read the job id to be restarted from 
ZooKeeper, but because we lost our S3 Storage as part of the outage it cannot 
find the checkpoint to restart from, and dies.

```
Found 1 checkpoints in 
ZooKeeperStateHandleStore{namespace='flink/aiops/ir-lifecycle/jobs/2512c6153c7ae16fa6da6d64772d75c5/checkpoints'
Trying to fetch 1 checkpoints from storage.
Trying to retrieve checkpoint 50417.

exception: JobMaster for job 2512c6153c7ae16fa6da6d64772d75c5 failed.
Caused by: org.apache.flink.runtime.client.JobInitializationException: Could 
not start the JobMaster.
Caused by: java.util.concurrent.CompletionException: 
java.lang.RuntimeException: 
org.apache.flink.runtime.client.JobExecutionException: Failed to initialize 
high-availability completed checkpoint store
...
Caused by: org.apache.flink.util.FlinkException: Could not retrieve checkpoint 
50417 from state handle under /0000000000000050417. This indicates that the 
retrieved state handle is broken. Try cleaning the state handle store.
Caused by: 
com.facebook.presto.hive.s3.PrestoS3FileSystem$UnrecoverableS3OperationException:
 com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not 
exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request 
ID: 17D7166A4D756355; S3 Extended Request ID: 
fe09be003d6379d952fad9de241c370b5f7ac43631c02fdfbc9dda9c4398d6df; Proxy: null), 
S3 Extended Request ID: 
fe09be003d6379d952fad9de241c370b5f7ac43631c02fdfbc9dda9c4398d6df (Path: 
s3://test/high-availability/flink-job/completedCheckpoint64d901465702)

Fatal error occurred in the cluster entrypoint.
```

Is there an option we can use to configure the job to ignore this error?

Kind regards

JM

Unless otherwise stated above:

IBM United Kingdom Limited
Registered in England and Wales with number 741598
Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU

Re: Failed to resume from HA when the checkpoint has been deleted.

Reply via email to