Hi, Have a query on the Job Manager HA for flink 1.15.
We currently run a standalone flink cluster with a single JobManager and multiple TaskManagers, deployed on top of a kubernetes cluster (EKS cluster) in application mode (reactive mode). The Task Managers are deployed as a ReplicaSet and the single Job Manager is configured to be highly available using the Kubernetes HA services with recovery data being written to S3. high-availability.storageDir: s3://<bucket-name>/flink/<app-name>/recovery We also have configured our cluster for the rocksdb state backend with checkpoints being written to S3. state.backend: rocksdb state.checkpoints.dir: s3://<bucket-name>/flink/<app-name>/checkpoints Now to test the Job Manager HA, when we delete the job manager deployment (to simulate job manager crash), we see that Kubernetes (EKS) detects the failure, launches a new Job Manager pod and is able to recover the application cluster from the last successful checkpoint (Restoring job 000....0000 from Checkpoint 5 @ 167...3692 for 000....0000 located at s3://.../checkpoints/00000...0000/chk-5). However, if we terminate the underlying node (EC2 instance) on which the Job Manager pod is scheduled, the cluster is unable to recover from this scenario. What we are seeing is that Kubernetes as usual tries and retries repeatedly to launch a newer Job Manager but this time the job manager is unable to find the checkpoint to recover from (No checkpoint found during restore), eventually going into a CrashLoopBackOff status after max attempts of restart. Now the query is will the Job Manager need to be configured to store its state to a local working directory over persistent volumes? Any pointers on how we can recover the cluster from such node failures or terminations? Vijay Jammi