Failed to resume from HA when the checkpoint has been deleted.

Jean-Marc Paulin Mon, 10 Jun 2024 03:54:40 -0700

Hi,

We have a 1.19 Flink streaming job, with HA enabled (ZooKeeper), 
checkpoint/savepoint in S3. We had an outage and now the jobmanager keeps 
restarting. We think it because it read the job id to be restarted from 
ZooKeeper, but because we lost our S3 Storage as part of the outage it cannot 
find the checkpoint to restart from, and dies.

```
Found 1 checkpoints in
ZooKeeperStateHandleStore{namespace='flink/aiops/ir-lifecycle/jobs/2512c6153c7ae16fa6da6d64772d75c5/checkpoints'
Trying to fetch 1 checkpoints from storage.
Trying to retrieve checkpoint 50417.

exception: JobMaster for job 2512c6153c7ae16fa6da6d64772d75c5 failed.
Caused by: org.apache.flink.runtime.client.JobInitializationException: Could
not start the JobMaster.
Caused by: java.util.concurrent.CompletionException:
java.lang.RuntimeException:
org.apache.flink.runtime.client.JobExecutionException: Failed to initialize
high-availability completed checkpoint store
...
Caused by: org.apache.flink.util.FlinkException: Could not retrieve checkpoint
50417 from state handle under /0000000000000050417. This indicates that the
retrieved state handle is broken. Try cleaning the state handle store.
Caused by:
com.facebook.presto.hive.s3.PrestoS3FileSystem$UnrecoverableS3OperationException:
com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not
exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request
ID: 17D7166A4D756355; S3 Extended Request ID:
fe09be003d6379d952fad9de241c370b5f7ac43631c02fdfbc9dda9c4398d6df; Proxy: null),
S3 Extended Request ID:
fe09be003d6379d952fad9de241c370b5f7ac43631c02fdfbc9dda9c4398d6df (Path:
s3://test/high-availability/flink-job/completedCheckpoint64d901465702)

Fatal error occurred in the cluster entrypoint.
```

Is there an option we can use to configure the job to ignore this error?

Kind regards

Unless otherwise stated above:

IBM United Kingdom Limited
Registered in England and Wales with number 741598
Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU

Failed to resume from HA when the checkpoint has been deleted.

Reply via email to