[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wikant updated ZOOKEEPER-4840:
------------------------------------
    Description: 
h2. TL;DR

After upgrade Apache Zookeeper & Apache Curator, our application is 
experiencing a new unexpected behaviour after events such as:
 * Zookeeper server(s) being restarted
 * Zookeeper server failing & being replaced

h2. Background

Application is using Zookeeper for leader election & metadata storage. The 
application runs on 3 hosts which each also have 1 Zookeeper server/daemon 
running.

Previously the application was running on Zookeeper version 3.5.10 & Curator 
version 4.3.0

After upgrading to Zookeeper version 3.9.1 & Curator version 5.2.0 a new edge 
case was observed where after Zookeeper daemons are restarted/failed the 
application (i.e. Zookeeper client) enters a 15+ minute loop of repeatedly 
logging "{{{}SessionExpiredException"{}}}

These repeated "{{{}SessionExpiredException"{}}} are not indicative of a full 
Zookeeper client/communication outage because DEBUG logs show that other 
Zookeeper sessions are communicating just fine. The 
"{{{}SessionExpiredException"{}}} logs unfortunately do not show the associated 
Session ID which is encountering the "{{{}SessionExpiredException"{}}}
h2. Symptoms

When using Zookeeper version 3.9.1 & Curator version 5.2.0, after 
restarting/failing some of the Zookeeper daemons:
 # All the 3 zookeeper clients experience some connections failures lasting a 
few seconds after the Zookeeper daemons were failed/restarted.
 # These connection failure issues are resolved shortly without any action 
needed.
 # Around 1 minute after the Zookeeper daemons were failed/restarted, all the 3 
zookeeper clients start repeatedly logging "{{{}SessionExpiredException"{}}}
 # The "{{{}SessionExpiredException" {}}}is repeatedly logged for 15+ minutes. 
During this time there are no connectivity issues. We can see from the 
Zookeeper server logs that all 3 Zookeeper servers are receiving regular 
traffic from the clients.
 # Interestingly, each Zookeeper server is not receiving any requests from the 
local Zookeeper client for the duration of the period where 
"{{{}SessionExpiredException" {}}}is repeatedly logged by the clients. However, 
each Zookeeper server is receiving regular traffic from the 2 remote Zookeeper 
clients.

The evidence suggests that this is a client-side issue & the 
"{{{}SessionExpiredException" {}}}is being thrown before the request is even 
sent to the Zookeeper server.
h2. Root Cause Theory

Based on my analysis of the available DEBUG & TRACE level logs, my current 
working theory is that the Zookeeper client is repeatedly trying to use a 
connection in CLOSED state which is causing the client to repeatedly throw 
"{{{}SessionExpiredException" {}}}for each retry.

The first "{{{}SessionExpiredException" {}}}is thrown after a log pattern like 
the following:
{quote}2024-06-18 01:14:45,694 WARN Curator-ConnectionStateManager-0: Session 
timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed 
ms: 60000. Adjusted session timeout ms: 60000
2024-06-18 01:14:45,694 INFO Curator-ConnectionStateManager-0: ZK 
getBindInformation: 
ip-172-31-200-85.ec2.internal:22181,ip-172-31-200-230.ec2.internal:22181,ip-172-31-200-39.ec2.internal:22181
2024-06-18 01:14:45,695 INFO Curator-ConnectionStateManager-0: 
injectSessionExpiration() called
2024-06-18 01:14:45,695 INFO Curator-ConnectionStateManager-0: channel is told 
closing
2024-06-18 01:14:45,698 INFO epollEventLoopGroup-3-1: channel is disconnected: 
[id: 0x86a592c8, L:/172.31.200.230:35836 ! 
R:ip-172-31-200-230.ec2.internal/172.31.200.230:22181]
2024-06-18 01:14:45,698 INFO epollEventLoopGroup-3-1: channel is told closing
2024-06-18 01:14:45,698 TRACE 
StateStore-bg-thread-SendThread(ip-172-31-200-230.ec2.internal:22181): 
SendThread exited loop for session: 0x1000000fc490005

2024-06-18 01:14:45,703 DEBUG StateStore-bg-thread-EventThread: Retry-able 
exception received
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /EMR/StateStore/ApplicationsAttempt
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:133)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:53)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972)
    at 
org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)
    at 
org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)
    at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
    at 
org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)
    at 
org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)
    at 
org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:35)
    at 
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.readValue(ConnectionStateAwareSharedValue.java:253)
    at 
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.access$200(ConnectionStateAwareSharedValue.java:26)
    at 
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue$1.process(ConnectionStateAwareSharedValue.java:45)
    at 
org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:83)
    at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:563)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:538)
{quote}
It seems the initial Zookeeper daemon restart/failure is causing the Curator 
client to "{_}injectSessionExpiration(){_}" which is then causing a session 
"{_}0x1000000fc490005{_}" to have its SendThread exited.

 

  was:
h2. TL;DR

After upgrade Apache Zookeeper & Apache Curator, our application is 
experiencing a new unexpected behaviour after events such as:
 * Zookeeper server(s) being restarted
 * Zookeeper server failing & being replaced

h2. Background

Application is using Zookeeper for leader election & metadata storage. The 
application runs on 3 hosts which each also have 1 Zookeeper server/daemon 
running.

Previously the application was running on Zookeeper version 3.5.10 & Curator 
version 4.3.0

After upgrading to Zookeeper version 3.9.1 & Curator version 5.2.0 a new edge 
case was observed where after Zookeeper daemons are restarted/failed the 
application (i.e. Zookeeper client) enters a 15+ minute loop of repeatedly 
logging "{{{}SessionExpiredException"{}}}

These repeated "{{{}SessionExpiredException"{}}} are not indicative of a full 
Zookeeper client/communication outage because DEBUG logs show that other 
Zookeeper sessions are communicating just fine. The 
"{{{}SessionExpiredException"{}}} logs unfortunately do not show the associated 
Session ID which is encountering the "{{{}SessionExpiredException"{}}}
h2. Symptoms

When using Zookeeper version 3.9.1 & Curator version 5.2.0, after 
restarting/failing some of the Zookeeper daemons:
 # All the 3 zookeeper clients experience some connections failures lasting a 
few seconds after the Zookeeper daemons were failed/restarted.
 # These connection failure issues are resolved shortly without any action 
needed.
 # Around 1 minute after the Zookeeper daemons were failed/restarted, all the 3 
zookeeper clients start repeatedly logging "{{{}SessionExpiredException"{}}}
 # The "{{{}SessionExpiredException" {}}}is repeatedly logged for 15+ minutes. 
During this time there are no connectivity issues. We can see from the 
Zookeeper server logs that all 3 Zookeeper servers are receiving regular 
traffic from the clients.
 # Interestingly, each Zookeeper server is not receiving any requests from the 
local Zookeeper client for the duration of the period where 
"{{{}SessionExpiredException"{}}}is repeatedly logged by the clients. However, 
each Zookeeper server is receiving regular traffic from the 2 remote Zookeeper 
clients.

The evidence suggests that this is a client-side issue & the 
"{{{}SessionExpiredException" {}}}is being thrown before the request is even 
sent to the Zookeeper server.

 

 


> Repeated SessionExpiredException after Zookeeper daemon restart
> ---------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4840
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4840
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.9.1
>            Reporter: Kevin Wikant
>            Priority: Major
>
> h2. TL;DR
> After upgrade Apache Zookeeper & Apache Curator, our application is 
> experiencing a new unexpected behaviour after events such as:
>  * Zookeeper server(s) being restarted
>  * Zookeeper server failing & being replaced
> h2. Background
> Application is using Zookeeper for leader election & metadata storage. The 
> application runs on 3 hosts which each also have 1 Zookeeper server/daemon 
> running.
> Previously the application was running on Zookeeper version 3.5.10 & Curator 
> version 4.3.0
> After upgrading to Zookeeper version 3.9.1 & Curator version 5.2.0 a new edge 
> case was observed where after Zookeeper daemons are restarted/failed the 
> application (i.e. Zookeeper client) enters a 15+ minute loop of repeatedly 
> logging "{{{}SessionExpiredException"{}}}
> These repeated "{{{}SessionExpiredException"{}}} are not indicative of a full 
> Zookeeper client/communication outage because DEBUG logs show that other 
> Zookeeper sessions are communicating just fine. The 
> "{{{}SessionExpiredException"{}}} logs unfortunately do not show the 
> associated Session ID which is encountering the 
> "{{{}SessionExpiredException"{}}}
> h2. Symptoms
> When using Zookeeper version 3.9.1 & Curator version 5.2.0, after 
> restarting/failing some of the Zookeeper daemons:
>  # All the 3 zookeeper clients experience some connections failures lasting a 
> few seconds after the Zookeeper daemons were failed/restarted.
>  # These connection failure issues are resolved shortly without any action 
> needed.
>  # Around 1 minute after the Zookeeper daemons were failed/restarted, all the 
> 3 zookeeper clients start repeatedly logging "{{{}SessionExpiredException"{}}}
>  # The "{{{}SessionExpiredException" {}}}is repeatedly logged for 15+ 
> minutes. During this time there are no connectivity issues. We can see from 
> the Zookeeper server logs that all 3 Zookeeper servers are receiving regular 
> traffic from the clients.
>  # Interestingly, each Zookeeper server is not receiving any requests from 
> the local Zookeeper client for the duration of the period where 
> "{{{}SessionExpiredException" {}}}is repeatedly logged by the clients. 
> However, each Zookeeper server is receiving regular traffic from the 2 remote 
> Zookeeper clients.
> The evidence suggests that this is a client-side issue & the 
> "{{{}SessionExpiredException" {}}}is being thrown before the request is even 
> sent to the Zookeeper server.
> h2. Root Cause Theory
> Based on my analysis of the available DEBUG & TRACE level logs, my current 
> working theory is that the Zookeeper client is repeatedly trying to use a 
> connection in CLOSED state which is causing the client to repeatedly throw 
> "{{{}SessionExpiredException" {}}}for each retry.
> The first "{{{}SessionExpiredException" {}}}is thrown after a log pattern 
> like the following:
> {quote}2024-06-18 01:14:45,694 WARN Curator-ConnectionStateManager-0: Session 
> timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed 
> ms: 60000. Adjusted session timeout ms: 60000
> 2024-06-18 01:14:45,694 INFO Curator-ConnectionStateManager-0: ZK 
> getBindInformation: 
> ip-172-31-200-85.ec2.internal:22181,ip-172-31-200-230.ec2.internal:22181,ip-172-31-200-39.ec2.internal:22181
> 2024-06-18 01:14:45,695 INFO Curator-ConnectionStateManager-0: 
> injectSessionExpiration() called
> 2024-06-18 01:14:45,695 INFO Curator-ConnectionStateManager-0: channel is 
> told closing
> 2024-06-18 01:14:45,698 INFO epollEventLoopGroup-3-1: channel is 
> disconnected: [id: 0x86a592c8, L:/172.31.200.230:35836 ! 
> R:ip-172-31-200-230.ec2.internal/172.31.200.230:22181]
> 2024-06-18 01:14:45,698 INFO epollEventLoopGroup-3-1: channel is told closing
> 2024-06-18 01:14:45,698 TRACE 
> StateStore-bg-thread-SendThread(ip-172-31-200-230.ec2.internal:22181): 
> SendThread exited loop for session: 0x1000000fc490005
> 2024-06-18 01:14:45,703 DEBUG StateStore-bg-thread-EventThread: Retry-able 
> exception received
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired for /EMR/StateStore/ApplicationsAttempt
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:133)
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:53)
>     at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972)
>     at 
> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)
>     at 
> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)
>     at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
>     at 
> org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)
>     at 
> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)
>     at 
> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:35)
>     at 
> org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.readValue(ConnectionStateAwareSharedValue.java:253)
>     at 
> org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.access$200(ConnectionStateAwareSharedValue.java:26)
>     at 
> org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue$1.process(ConnectionStateAwareSharedValue.java:45)
>     at 
> org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:83)
>     at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:563)
>     at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:538)
> {quote}
> It seems the initial Zookeeper daemon restart/failure is causing the Curator 
> client to "{_}injectSessionExpiration(){_}" which is then causing a session 
> "{_}0x1000000fc490005{_}" to have its SendThread exited.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to