Adrian Vasiliu created FLINK-25098: -------------------------------------- Summary: Jobmanager CrashLoopBackOff in HA configuration Key: FLINK-25098 URL: https://issues.apache.org/jira/browse/FLINK-25098 Project: Flink Issue Type: Bug Components: Deployment / Kubernetes Affects Versions: 1.13.3, 1.13.2 Environment: Reproduced with: * Persistent jobs storage provided by the rocks-cephfs storage class. * OpenShift 4.9.5. Reporter: Adrian Vasiliu
In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to CrashLoopBackoff for all replicas. Attaching the full logs of the `jobmanager` and tls-proxy` containers of jobmanager pod: [^jm-flink-ha-jobmanager-log.txt] [^jm-flink-ha-tls-proxy-log.txt] Remarks: * This is a follow-up of https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524. * Picked Critical severity as HA is critical for our product. -- This message was sent by Atlassian Jira (v8.20.1#820001)