Hi, I am running Flink v1.12.2 in Standalone mode on Kubernetes. I set Kubernetes native as HA.
The HA works well when either jobmanager or taskmanager pod lost or crashes. But, when I restart master node, jobmanager pod will always crash and restart. This results in the entire Flink cluster restart and most of taskmanager pod will restart as well. I didn’t see this issue when using zookeeper as HA. Not sure if this is a bug should be handle or there is some work around. Below is my Flink setting Job-Manager flink-conf.yaml: ---- jobmanager.rpc.address: streakerflink-jobmanager high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory high-availability.cluster-id: /streaker high-availability.jobmanager.port: 6123 high-availability.storageDir: hdfs://hdfs-namenode-0.hdfs-namenode:8020/flink kubernetes.cluster-id: streaker rest.address: streakerflink-jobmanager rest.bind-port: 8081 rest.port: 8081 state.checkpoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode:8020/flink/streaker blob.server.port: 6124 metrics.internal.query-service.port: 6125 metrics.reporters: prom metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter metrics.reporter.prom.port: 9999 restart-strategy: fixed-delay restart-strategy.fixed-delay.attempts: 2147483647 restart-strategy.fixed-delay.delay: 5 s jobmanager.memory.process.size: 1768m parallelism.default: 1 task.cancellation.timeout: 2000 web.log.path: /opt/flink/log/output.log jobmanager.web.log.path: /opt/flink/log/output.log web.submit.enable: false Task-Manager flink-conf.yaml: ---- jobmanager.rpc.address: streakerflink-jobmanager high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory high-availability.cluster-id: /streaker high-availability.storageDir: hdfs://hdfs-namenode-0.hdfs-namenode:8020/flink kubernetes.cluster-id: streaker taskmanager.network.bind-policy: ip taskmanager.data.port: 6121 taskmanager.rpc.port: 6122 restart-strategy: fixed-delay restart-strategy.fixed-delay.attempts: 2147483647 restart-strategy.fixed-delay.delay: 5 s taskmanager.memory.task.heap.size: 9728m taskmanager.memory.framework.off-heap.size: 512m taskmanager.memory.managed.size: 512m taskmanager.memory.jvm-metaspace.size: 256m taskmanager.memory.jvm-overhead.max: 3g taskmanager.memory.jvm-overhead.fraction: 0.035 taskmanager.memory.network.fraction: 0.03 taskmanager.memory.network.max: 3g taskmanager.numberOfTaskSlots: 1 taskmanager.jvm-exit-on-oom: true metrics.internal.query-service.port: 6125 metrics.reporters: prom metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter metrics.reporter.prom.port: 9999 web.log.path: /opt/flink/log/output.log taskmanager.log.path: /opt/flink/log/output.log task.cancellation.timeout: 2000 Any help will be appreciated! Thanks, Jerome