Hi, We have a 1.19 Flink streaming job, with HA enabled (ZooKeeper), checkpoint/savepoint in S3. We had an outage and now the jobmanager keeps restarting. We think it because it read the job id to be restarted from ZooKeeper, but because we lost our S3 Storage as part of the outage it cannot find the checkpoint to restart from, and dies.
``` Found 1 checkpoints in ZooKeeperStateHandleStore{namespace='flink/aiops/ir-lifecycle/jobs/2512c6153c7ae16fa6da6d64772d75c5/checkpoints' Trying to fetch 1 checkpoints from storage. Trying to retrieve checkpoint 50417. exception: JobMaster for job 2512c6153c7ae16fa6da6d64772d75c5 failed. Caused by: org.apache.flink.runtime.client.JobInitializationException: Could not start the JobMaster. Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Failed to initialize high-availability completed checkpoint store ... Caused by: org.apache.flink.util.FlinkException: Could not retrieve checkpoint 50417 from state handle under /0000000000000050417. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. Caused by: com.facebook.presto.hive.s3.PrestoS3FileSystem$UnrecoverableS3OperationException: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 17D7166A4D756355; S3 Extended Request ID: fe09be003d6379d952fad9de241c370b5f7ac43631c02fdfbc9dda9c4398d6df; Proxy: null), S3 Extended Request ID: fe09be003d6379d952fad9de241c370b5f7ac43631c02fdfbc9dda9c4398d6df (Path: s3://test/high-availability/flink-job/completedCheckpoint64d901465702) Fatal error occurred in the cluster entrypoint. ``` Is there an option we can use to configure the job to ignore this error? Kind regards JM Unless otherwise stated above: IBM United Kingdom Limited Registered in England and Wales with number 741598 Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU