Flink Job Manager Recovery from EKS Node Terminations

Vijay Jammi Thu, 05 Jan 2023 12:24:21 -0800

Hi,

Have a query on the Job Manager HA for flink 1.15.


We currently run a standalone flink cluster with a single JobManager and
multiple TaskManagers, deployed on top of a kubernetes cluster (EKS
cluster) in application mode (reactive mode).

The Task Managers are deployed as a ReplicaSet and the single Job Manager
is configured to be highly available using the Kubernetes HA services with
recovery data being written to S3.
      high-availability.storageDir:
s3://<bucket-name>/flink/<app-name>/recovery

We also have configured our cluster for the rocksdb state backend with
checkpoints being written to S3.
      state.backend: rocksdb
      state.checkpoints.dir: s3://<bucket-name>/flink/<app-name>/checkpoints

Now to test the Job Manager HA, when we delete the job manager deployment
(to simulate job manager crash), we see that Kubernetes (EKS) detects
the failure, launches a new Job Manager pod and is able to recover the
application cluster from the last successful checkpoint (Restoring job
000....0000 from Checkpoint 5 @ 167...3692 for 000....0000 located at
s3://.../checkpoints/00000...0000/chk-5).

However, if we terminate the underlying node (EC2 instance) on which the
Job Manager pod is scheduled, the cluster is unable to recover from this
scenario. What we are seeing is that Kubernetes as usual tries and retries
repeatedly to launch a newer Job Manager but this time the job manager is
unable to find the checkpoint to recover from (No checkpoint found during
restore), eventually going into a CrashLoopBackOff status after max
attempts of restart.

Now the query is will the Job Manager need to be configured to store its
state to a local working directory over persistent volumes? Any pointers on
how we can recover the cluster from such node failures or terminations?

Vijay Jammi

Flink Job Manager Recovery from EKS Node Terminations

Reply via email to