Attaching the job manager log for reference.

2020-03-22 11:39:02,693 WARN
 org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever  -
Error while retrieving the leader gateway. Retrying to connect to
akka.tcp://flink@host1:28681/user/dispatcher.
2020-03-22 11:39:02,724 WARN  akka.remote.transport.netty.NettyTransport
                 - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,724 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink@host1:28681]
has failed, address is now gated for [50] ms. Reason: [Association failed
with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
host1/ipaddress1:28681]
2020-03-22 11:39:02,791 WARN  akka.remote.transport.netty.NettyTransport
                 - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,792 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink@host1:28681]
has failed, address is now gated for [50] ms. Reason: [Association failed
with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
host1/ipaddress1:28681]
2020-03-22 11:39:02,861 WARN  akka.remote.transport.netty.NettyTransport
                 - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,861 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink@host1:28681]
has failed, address is now gated for [50] ms. Reason: [Association failed
with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
host1/ipaddress1:28681]
2020-03-22 11:39:02,931 WARN  akka.remote.transport.netty.NettyTransport
                 - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,931 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink@host1:28681]
has failed, address is now gated for [50] ms. Reason: [Association failed
with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
host1/ipaddress1:28681]
2020-03-22 11:39:03,001 WARN  akka.remote.transport.netty.NettyTransport
                 - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,002 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink@host1:28681]
has failed, address is now gated for [50] ms. Reason: [Association failed
with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
host1/ipaddress1:28681]
2020-03-22 11:39:03,071 WARN  akka.remote.transport.netty.NettyTransport
                 - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,071 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink@host1:28681]
has failed, address is now gated for [50] ms. Reason: [Association failed
with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
host1/ipaddress1:28681]
2020-03-22 11:39:03,141 WARN  akka.remote.transport.netty.NettyTransport
                 - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,141 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink@host1:28681]
has failed, address is now gated for [50] ms. Reason: [Association failed
with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
host1/ipaddress1:28681]
2020-03-22 11:39:03,211 WARN  akka.remote.transport.netty.NettyTransport
                 - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,211 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink@host1:28681]
has failed, address is now gated for [50] ms. Reason: [Association failed
with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
host1/ipaddress1:28681]
2020-03-22 11:39:03,281 WARN  akka.remote.transport.netty.NettyTransport
                 - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,282 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink@host1:28681]
has failed, address is now gated for [50] ms. Reason: [Association failed
with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
host1/ipaddress1:28681]
2020-03-22 11:39:03,351 WARN  akka.remote.transport.netty.NettyTransport
                 - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,351 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink@host1:28681]
has failed, address is now gated for [50] ms. Reason: [Association failed
with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
host1/ipaddress1:28681]
2020-03-22 11:39:03,421 WARN  akka.remote.transport.netty.NettyTransport
                 - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,421 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink@host1:28681]
has failed, address is now gated for [50] ms. Reason: [Association failed
with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
host1/ipaddress1:28681]

Thanks,
Dinesh

On Sun, Mar 22, 2020 at 1:25 PM Dinesh J <dineshj...@gmail.com> wrote:

> Hi all,
> We have single job yarn flink cluster setup with High Availability.
> Sometimes job manager failure successfully restarts next attempt from
> current checkpoint.
> But occasionally we are getting below error.
>
> {"errors":["Service temporarily unavailable due to an ongoing leader 
> election. Please refresh."]}
>
> Hadoop version using : Hadoop 2.7.1.2.4.0.0-169
>
> Flink version: flink-1.7.2
>
> Zookeeper version: 3.4.6-169--1
>
>
> *Below is the flink configuration*
>
> high-availability: zookeeper
>
> high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181
>
> high-availability.storageDir: hdfs:///flink/ha
>
> high-availability.zookeeper.path.root: /flink
>
> yarn.application-attempts: 10
>
> state.backend: rocksdb
>
> state.checkpoints.dir: hdfs:///flink/checkpoint
>
> state.savepoints.dir: hdfs:///flink/savepoint
>
> jobmanager.execution.failover-strategy: region
>
> restart-strategy: failure-rate
>
> restart-strategy.failure-rate.max-failures-per-interval: 3
>
> restart-strategy.failure-rate.failure-rate-interval: 5 min
>
> restart-strategy.failure-rate.delay: 10 s
>
>
>
> Can someone let know if I am missing something or is it a known issue?
>
> Is it something related to hostname ip mapping issue or zookeeper version 
> issue?
>
> Thanks,
>
> Dinesh
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Reply via email to