Re: Issue with single job yarn flink cluster HA

Dinesh J Mon, 30 Mar 2020 20:49:20 -0700

Hi Yang,
I am attaching one full jobmanager log for a job which I reran today. This
a job that tries to read from savepoint.
Same error message "leader election onging" is displayed. And this stays
the same even after 30 minutes. If I leave the job without yarn kill, it
stays the same forever.
Based on your suggestions till now, I guess it might be some zookeeper
problem. If that is the case, what can I lookout for in zookeeper to figure
out the issue?


Thanks,
Dinesh


On Tue, Mar 31, 2020 at 7:42 AM Yang Wang <danrtsey...@gmail.com> wrote:

> I think your problem is not about akka timeout. Increase the timeout could
> help in a
> heavy load cluster, especially for the network is not very good. However,
> that is not
> your case now.
>
> I am not sure about the "never recovery". Do you mean the logs "Connection
> refused"
> keep going and do not have other logs? How long does it stay in "leader
> election onging".
> Usually, it takes at most 60s. Since if the old jobmanager crashed, then
> it will lose
> the leadership after zookeeper session timeout. So when the new jobmanager
> always
> could not grant the leadership, it may because of some problem of
> zookeeper.
>
> Maybe you need to share the complete jobmanager logs so that we could know
> what
> is happening in the jobmanager.
>
>
> Best,
> Yang
>
>
> Dinesh J <dineshj...@gmail.com> 于2020年3月31日周二 上午3:46写道：
>
>> HI Yang,
>> Thanks for the clarification and suggestion. But my problem was that
>> recovery never happens and the message "leader election ongoing" is what
>> the message displayed forever.
>> Do you think increasing akka.ask.timeout and akka.tcp.timeout will help
>> in case of a heavy/highload cluster as this issue happens mainly during
>> heavy load in cluster?
>>
>> Best,
>> Dinesh
>>
>> On Mon, Mar 30, 2020 at 2:29 PM Yang Wang <danrtsey...@gmail.com> wrote:
>>
>>> Hi Dinesh,
>>>
>>> First, i think the error message your provided is not a problem. It
>>> just indicates that the leader
>>> election is still ongoing. When it finished, the new leader will start
>>> the a new dispatcher to provide
>>> the webui and rest service.
>>>
>>> From your jobmanager logs "Connection refused: host1/ipaddress1:28681",
>>> we could know that
>>> the old jobmanager has failed. When a new jobmanager started, since the
>>> old jobmanager still
>>> hold the lock of leader latch. So Flink tries to connect with it. After
>>> it tries few times, since the old
>>> jobmanager zookeeper client do not update the leader latch, then the new
>>> jobmanager will elect
>>> successfully and be the active leader. It is just how the leader
>>> election works.
>>>
>>> In a nutshell, the root cause is old jobmanager crashed and it does not
>>> lose the leader immediately.
>>> It is the by-design behavior.
>>>
>>> If you really want to make the recovery faster, i think you could
>>> decrease "high-availability.zookeeper.client.connection-timeout"
>>> and "high-availability.zookeeper.client.session-timeout". Please keep in
>>> mind that too small value
>>> will also cause unexpected failover because of network problem.
>>>
>>>
>>> Best,
>>> Yang
>>>
>>> Dinesh J <dineshj...@gmail.com> 于2020年3月25日周三 下午4:20写道：
>>>
>>>> Hi Andrey,
>>>> Yes . The job is not restarting sometimes after the current leader
>>>> failure.
>>>> Below is the message displayed when trying to reach the application
>>>> master url via yarn ui and message remains the same even if the yarn job is
>>>> running for 2 days.
>>>> During this time , even current yarn application attempt is not getting
>>>> failed and no containers are launched for jobmanager and taskmanager.
>>>>
>>>> *{"errors":["Service temporarily unavailable due to an ongoing leader
>>>> election. Please refresh."]}*
>>>>
>>>> Thanks,
>>>> Dinesh
>>>>
>>>> On Tue, Mar 24, 2020 at 6:45 PM Andrey Zagrebin <azagre...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Dinesh,
>>>>>
>>>>> If the current leader crashes (e.g. due to network failures) then
>>>>> getting these messages do not look like a problem during the leader
>>>>> re-election.
>>>>> They look to me just as warnings that caused failover.
>>>>>
>>>>> Do you observe any problem with your application? Does the failover
>>>>> not work, e.g. no leader is elected or a job is not restarted after the
>>>>> current leader failure?
>>>>>
>>>>> Best,
>>>>> Andrey
>>>>>
>>>>> On Sun, Mar 22, 2020 at 11:14 AM Dinesh J <dineshj...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Attaching the job manager log for reference.
>>>>>>
>>>>>> 2020-03-22 11:39:02,693 WARN
>>>>>>  org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever  
>>>>>> -
>>>>>> Error while retrieving the leader gateway. Retrying to connect to
>>>>>> akka.tcp://flink@host1:28681/user/dispatcher.
>>>>>> 2020-03-22 11:39:02,724 WARN
>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>> refused: host1/ipaddress1:28681
>>>>>> 2020-03-22 11:39:02,724 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>                        - Association with remote system
>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]]
>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>> 2020-03-22 11:39:02,791 WARN
>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>> refused: host1/ipaddress1:28681
>>>>>> 2020-03-22 11:39:02,792 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>                        - Association with remote system
>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]]
>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>> 2020-03-22 11:39:02,861 WARN
>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>> refused: host1/ipaddress1:28681
>>>>>> 2020-03-22 11:39:02,861 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>                        - Association with remote system
>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]]
>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>> 2020-03-22 11:39:02,931 WARN
>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>> refused: host1/ipaddress1:28681
>>>>>> 2020-03-22 11:39:02,931 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>                        - Association with remote system
>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]]
>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>> 2020-03-22 11:39:03,001 WARN
>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>> refused: host1/ipaddress1:28681
>>>>>> 2020-03-22 11:39:03,002 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>                        - Association with remote system
>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]]
>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>> 2020-03-22 11:39:03,071 WARN
>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>> refused: host1/ipaddress1:28681
>>>>>> 2020-03-22 11:39:03,071 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>                        - Association with remote system
>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]]
>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>> 2020-03-22 11:39:03,141 WARN
>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>> refused: host1/ipaddress1:28681
>>>>>> 2020-03-22 11:39:03,141 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>                        - Association with remote system
>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]]
>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>> 2020-03-22 11:39:03,211 WARN
>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>> refused: host1/ipaddress1:28681
>>>>>> 2020-03-22 11:39:03,211 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>                        - Association with remote system
>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]]
>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>> 2020-03-22 11:39:03,281 WARN
>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>> refused: host1/ipaddress1:28681
>>>>>> 2020-03-22 11:39:03,282 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>                        - Association with remote system
>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]]
>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>> 2020-03-22 11:39:03,351 WARN
>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>> refused: host1/ipaddress1:28681
>>>>>> 2020-03-22 11:39:03,351 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>                        - Association with remote system
>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]]
>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>> 2020-03-22 11:39:03,421 WARN
>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>> refused: host1/ipaddress1:28681
>>>>>> 2020-03-22 11:39:03,421 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>                        - Association with remote system
>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]]
>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>>
>>>>>> Thanks,
>>>>>> Dinesh
>>>>>>
>>>>>> On Sun, Mar 22, 2020 at 1:25 PM Dinesh J <dineshj...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>> We have single job yarn flink cluster setup with High Availability.
>>>>>>> Sometimes job manager failure successfully restarts next attempt
>>>>>>> from current checkpoint.
>>>>>>> But occasionally we are getting below error.
>>>>>>>
>>>>>>> {"errors":["Service temporarily unavailable due to an ongoing leader 
>>>>>>> election. Please refresh."]}
>>>>>>>
>>>>>>> Hadoop version using : Hadoop 2.7.1.2.4.0.0-169
>>>>>>>
>>>>>>> Flink version: flink-1.7.2
>>>>>>>
>>>>>>> Zookeeper version: 3.4.6-169--1
>>>>>>>
>>>>>>>
>>>>>>> *Below is the flink configuration*
>>>>>>>
>>>>>>> high-availability: zookeeper
>>>>>>>
>>>>>>> high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181
>>>>>>>
>>>>>>> high-availability.storageDir: hdfs:///flink/ha
>>>>>>>
>>>>>>> high-availability.zookeeper.path.root: /flink
>>>>>>>
>>>>>>> yarn.application-attempts: 10
>>>>>>>
>>>>>>> state.backend: rocksdb
>>>>>>>
>>>>>>> state.checkpoints.dir: hdfs:///flink/checkpoint
>>>>>>>
>>>>>>> state.savepoints.dir: hdfs:///flink/savepoint
>>>>>>>
>>>>>>> jobmanager.execution.failover-strategy: region
>>>>>>>
>>>>>>> restart-strategy: failure-rate
>>>>>>>
>>>>>>> restart-strategy.failure-rate.max-failures-per-interval: 3
>>>>>>>
>>>>>>> restart-strategy.failure-rate.failure-rate-interval: 5 min
>>>>>>>
>>>>>>> restart-strategy.failure-rate.delay: 10 s
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Can someone let know if I am missing something or is it a known issue?
>>>>>>>
>>>>>>> Is it something related to hostname ip mapping issue or zookeeper 
>>>>>>> version issue?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Dinesh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>

full_log_failed_container.log
Description: Binary data

Re: Issue with single job yarn flink cluster HA

Reply via email to