Hi all,
We have single job yarn flink cluster setup with High Availability.
Sometimes job manager failure successfully restarts next attempt from
current checkpoint.
But occasionally we are getting below error.

{"errors":["Service temporarily unavailable due to an ongoing leader
election. Please refresh."]}

Hadoop version using : Hadoop 2.7.1.2.4.0.0-169

Flink version: flink-1.7.2

Zookeeper version: 3.4.6-169--1


*Below is the flink configuration*

high-availability: zookeeper

high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181

high-availability.storageDir: hdfs:///flink/ha

high-availability.zookeeper.path.root: /flink

yarn.application-attempts: 10

state.backend: rocksdb

state.checkpoints.dir: hdfs:///flink/checkpoint

state.savepoints.dir: hdfs:///flink/savepoint

jobmanager.execution.failover-strategy: region

restart-strategy: failure-rate

restart-strategy.failure-rate.max-failures-per-interval: 3

restart-strategy.failure-rate.failure-rate-interval: 5 min

restart-strategy.failure-rate.delay: 10 s



Can someone let know if I am missing something or is it a known issue?

Is it something related to hostname ip mapping issue or zookeeper version issue?

Thanks,

Dinesh

Reply via email to