By "restart master node", do you mean to restart the K8s master
component(e.g. APIServer, ETCD, etc.)?

Even though the master components are restarted, the Flink JobManager and
TaskManager should eventually get to work.
Could you please share the JobManager logs so that we could debug why it
crashed.


Best,
Yang

Jerome Li <l...@vmware.com> 于2021年5月25日周二 上午3:43写道:

> Hi,
>
>
>
> I am running Flink v1.12.2 in Standalone mode on Kubernetes. I set
> Kubernetes native as HA.
>
>
>
> The HA works well when either jobmanager or taskmanager pod lost or
> crashes.
>
>
>
> But, when I restart master node, jobmanager pod will always crash and
> restart. This results in the entire Flink cluster restart and most of
> taskmanager pod will restart as well.
>
>
>
> I didn’t see this issue when using zookeeper as HA. Not sure if this is a
> bug should be handle or there is some work around.
>
>
>
>
>
> Below is my Flink setting
>
> Job-Manager
>
> flink-conf.yaml:
>
> ----
>
> jobmanager.rpc.address: streakerflink-jobmanager
>
>
>
> high-availability:
> org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
>
> high-availability.cluster-id: /streaker
>
> high-availability.jobmanager.port: 6123
>
> high-availability.storageDir:
> hdfs://hdfs-namenode-0.hdfs-namenode:8020/flink
>
> kubernetes.cluster-id: streaker
>
>
>
> rest.address: streakerflink-jobmanager
>
> rest.bind-port: 8081
>
> rest.port: 8081
>
>
>
> state.checkpoints.dir:
> hdfs://hdfs-namenode-0.hdfs-namenode:8020/flink/streaker
>
>
>
> blob.server.port: 6124
>
> metrics.internal.query-service.port: 6125
>
> metrics.reporters: prom
>
> metrics.reporter.prom.class:
> org.apache.flink.metrics.prometheus.PrometheusReporter
>
> metrics.reporter.prom.port: 9999
>
>
>
> restart-strategy: fixed-delay
>
> restart-strategy.fixed-delay.attempts: 2147483647
>
> restart-strategy.fixed-delay.delay: 5 s
>
>
>
> jobmanager.memory.process.size: 1768m
>
>
>
> parallelism.default: 1
>
>
>
> task.cancellation.timeout: 2000
>
>
>
> web.log.path: /opt/flink/log/output.log
>
> jobmanager.web.log.path: /opt/flink/log/output.log
>
>
>
> web.submit.enable: false
>
>
>
> Task-Manager
>
> flink-conf.yaml:
>
> ----
>
> jobmanager.rpc.address: streakerflink-jobmanager
>
>
>
> high-availability:
> org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
>
> high-availability.cluster-id: /streaker
>
> high-availability.storageDir:
> hdfs://hdfs-namenode-0.hdfs-namenode:8020/flink
>
> kubernetes.cluster-id: streaker
>
>
>
> taskmanager.network.bind-policy: ip
>
>
>
> taskmanager.data.port: 6121
>
> taskmanager.rpc.port: 6122
>
>
>
> restart-strategy: fixed-delay
>
> restart-strategy.fixed-delay.attempts: 2147483647
>
> restart-strategy.fixed-delay.delay: 5 s
>
>
>
> taskmanager.memory.task.heap.size: 9728m
>
> taskmanager.memory.framework.off-heap.size: 512m
>
> taskmanager.memory.managed.size: 512m
>
> taskmanager.memory.jvm-metaspace.size: 256m
>
> taskmanager.memory.jvm-overhead.max: 3g
>
> taskmanager.memory.jvm-overhead.fraction: 0.035
>
> taskmanager.memory.network.fraction: 0.03
>
> taskmanager.memory.network.max: 3g
>
> taskmanager.numberOfTaskSlots: 1
>
>
>
> taskmanager.jvm-exit-on-oom: true
>
>
>
> metrics.internal.query-service.port: 6125
>
> metrics.reporters: prom
>
> metrics.reporter.prom.class:
> org.apache.flink.metrics.prometheus.PrometheusReporter
>
> metrics.reporter.prom.port: 9999
>
>
>
> web.log.path: /opt/flink/log/output.log
>
> taskmanager.log.path: /opt/flink/log/output.log
>
>
>
> task.cancellation.timeout: 2000
>
>
>
> Any help will be appreciated!
>
>
>
> Thanks,
>
> Jerome
>

Reply via email to