[
https://issues.apache.org/jira/browse/ZOOKEEPER-4840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kevin Wikant updated ZOOKEEPER-4840:
------------------------------------
Description:
h2. TL;DR
After upgrade Apache Zookeeper & Apache Curator, our application is
experiencing a new unexpected behaviour after events such as:
* Zookeeper server(s) being restarted
* Zookeeper server failing & being replaced
h2. Background
Application is using Zookeeper for leader election & metadata storage. The
application runs on 3 hosts which each also have 1 Zookeeper server/daemon
running.
Previously the application was running on Zookeeper version 3.5.10 & Curator
version 4.3.0
After upgrading to Zookeeper version 3.9.1 & Curator version 5.2.0 a new edge
case was observed where after Zookeeper daemons are restarted/failed the
application (i.e. Zookeeper client) enters a 15+ minute loop of repeatedly
logging "{{{}SessionExpiredException"{}}}
These repeated "{{{}SessionExpiredException"{}}} are not indicative of a full
Zookeeper client/communication outage because DEBUG logs show that other
Zookeeper sessions are communicating just fine. The
"{{{}SessionExpiredException"{}}} logs unfortunately do not show the associated
Session ID which is encountering the "{{{}SessionExpiredException"{}}}
h2. Symptoms
When using Zookeeper version 3.9.1 & Curator version 5.2.0, after
restarting/failing some of the Zookeeper daemons:
# All the 3 zookeeper clients experience some connections failures lasting a
few seconds after the Zookeeper daemons were failed/restarted.
# These connection failure issues are resolved shortly without any action
needed.
# Around 1 minute after the Zookeeper daemons were failed/restarted, all the 3
zookeeper clients start repeatedly logging "{{{}SessionExpiredException"{}}}
# The "{{{}SessionExpiredException"{}}} is repeatedly logged for 15+ minutes.
During this time there are no connectivity issues. We can see from the
Zookeeper server logs that all 3 Zookeeper servers are receiving regular
traffic from the clients.
# Interestingly, each Zookeeper server is not receiving any requests from the
local Zookeeper client for the duration of the period where
"{{{}SessionExpiredException"{}}} is repeatedly logged by the clients. However,
each Zookeeper server is receiving regular traffic from the 2 remote Zookeeper
clients.
The evidence suggests that this is a client-side issue & the
"{{{}SessionExpiredException"{}}} is being thrown before the request is even
sent to the Zookeeper server.
h2. Root Cause Theory
Based on my analysis of the available DEBUG & TRACE level logs, my current
working theory is that the Zookeeper client is repeatedly trying to use a
connection/session in CLOSED state which is causing the client to repeatedly
throw "{{{}SessionExpiredException"{}}} for each retry.
The first "{{{}SessionExpiredException"{}}} is thrown after a log pattern like
the following:
{quote}2024-06-18 01:14:45,694 WARN Curator-ConnectionStateManager-0: Session
timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed
ms: 60000. Adjusted session timeout ms: 60000
2024-06-18 01:14:45,694 INFO Curator-ConnectionStateManager-0: ZK
getBindInformation:
ip-172-31-200-85.ec2.internal:22181,ip-172-31-200-230.ec2.internal:22181,ip-172-31-200-39.ec2.internal:22181
2024-06-18 01:14:45,695 INFO Curator-ConnectionStateManager-0:
injectSessionExpiration() called
2024-06-18 01:14:45,695 INFO Curator-ConnectionStateManager-0: channel is told
closing
2024-06-18 01:14:45,698 INFO epollEventLoopGroup-3-1: channel is disconnected:
[id: 0x86a592c8, L:/172.31.200.230:35836 !
R:ip-172-31-200-230.ec2.internal/172.31.200.230:22181]
2024-06-18 01:14:45,698 INFO epollEventLoopGroup-3-1: channel is told closing
2024-06-18 01:14:45,698 TRACE
application-SendThread(ip-172-31-200-230.ec2.internal:22181): SendThread exited
loop for session: 0x1000000fc490005
2024-06-18 01:14:45,703 DEBUG application-EventThread: Retry-able exception
received
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /...
at org.apache.zookeeper.KeeperException.create(KeeperException.java:133)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:53)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972)
at
org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)
at
org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
at
org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)
at
org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)
at
org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:35)
at
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.readValue(ConnectionStateAwareSharedValue.java:253)
at
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.access$200(ConnectionStateAwareSharedValue.java:26)
at
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue$1.process(ConnectionStateAwareSharedValue.java:45)
at
org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:83)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:563)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:538)
{quote}
It seems the initial Zookeeper daemon restart/failure is causing the Curator
client to "{_}injectSessionExpiration(){_}" which is then causing a session
"{_}0x1000000fc490005{_}" to have its SendThread exited.
[The injectSessionExpiration method is then putting the ClientCnxn object for
the session into CLOSED
state.|https://github.com/apache/zookeeper/blob/39973dc5fddf778733e4f0370980629c07a16d67/zookeeper-server/src/main/java/org/apache/zookeeper/ZooKeeperTestable.java#L41]
This is causing [the following code path to throw
"{{{}SessionExpiredException"{}}} before the request is even sent to the
Zookeeper
server|https://github.com/apache/zookeeper/blob/39973dc5fddf778733e4f0370980629c07a16d67/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L777]
Then 25 minutes later, the EventThread for the Zookeeper session is finally
also stopped:
{quote}2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
GetDataBuilderImpl-Background - 1163211 ms
2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
GetChildrenBuilderImpl-Background - 1163211 ms
2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
CreateBuilderImpl-Background - 1163211 ms
2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
GetChildrenBuilderImpl-Background - 1003198 ms
2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
GetDataBuilderImpl-Background - 383164 ms
2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
GetChildrenBuilderImpl-Background - 383164 ms
2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
CreateBuilderImpl-Background - 383164 ms
2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
GetChildrenBuilderImpl-Background - 266157 ms
2024-06-18 01:40:43,942 INFO application-EventThread: EventThread shut down for
session: 0x1000000fc490005
{quote}
Only after the EventThread is stopped do the "{{{}SessionExpiredException"{}}}
stop occurring.
This leads me to believe that the lingering EventThread for the closed session
is leading to the closed session being used which is leading to the repeated
"{{{}SessionExpiredException"{}}} which occur until the EventThread is stopped.
In between "2024-06-18 01:14:45" & "2024-06-18 01:40:43" we see 0 requests from
the local client "{_}172.31.200.230{_}" reaching the Zookeeper server also on
"{_}172.31.200.230{_}". However, other sessions to other Zookeeper servers are
unaffected & operate without connection or session expiry issues.
h2. Additional Notes
* We can see the stack trace from "Zookeeper.getData" down to "conPacketLoss"
here:
**
[Zookeeper.getData|https://github.com/apache/zookeeper/blob/39973dc5fddf778733e4f0370980629c07a16d67/zookeeper-server/src/main/java/org/apache/zookeeper/ZooKeeper.java#L2013]
**
[ClientCnxn.submitRequest|https://github.com/apache/zookeeper/blob/39973dc5fddf778733e4f0370980629c07a16d67/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1516]
**
[ClientCnxn.queuePacket|https://github.com/apache/zookeeper/blob/39973dc5fddf778733e4f0370980629c07a16d67/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1621]
**
[ClientCnxn.conPacketLoss|https://github.com/apache/zookeeper/blob/39973dc5fddf778733e4f0370980629c07a16d67/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L777]
* I have validated the Curator RetryPolicy configuration is not the problem
here. Each "{{{}SessionExpiredException"{}}} is retried 5 times within under 1
minute before being surfaced as a "Watcher exception":
{quote}2024-06-18 01:17:42,200 INFO application-EventThread: ZK
getBindInformation:
ip-172-31-200-85.ec2.internal:22181,ip-172-31-200-230.ec2.internal:22181,ip-172-31-200-39.ec2.internal:22181
2024-06-18 01:17:42,200 DEBUG application-EventThread: Retry-able exception
received
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /...
at org.apache.zookeeper.KeeperException.create(KeeperException.java:133)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:53)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972)
at
org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)
at
org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
at
org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)
at
org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)
at
org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:35)
at
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.readValue(ConnectionStateAwareSharedValue.java:253)
at
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.access$200(ConnectionStateAwareSharedValue.java:26)
at
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue$1.process(ConnectionStateAwareSharedValue.java:45)
at
org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:83)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:563)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:538)
...
2024-06-18 01:18:27,202 TRACE application-EventThread: Counter retries-allowed:
1
2024-06-18 01:18:27,202 DEBUG application-EventThread: Retrying operation
2024-06-18 01:18:27,202 INFO application-EventThread: ZK getBindInformation:
ip-172-31-200-85.ec2.internal:22181,ip-172-31-200-230.ec2.internal:22181,ip-172-31-200-39.ec2.internal:22181
2024-06-18 01:18:27,202 DEBUG application-EventThread: Retry-able exception
received
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /...
...
2024-06-18 01:18:27,202 TRACE application-EventThread: Counter
retries-disallowed: 1
2024-06-18 01:18:27,202 DEBUG application-EventThread: Retry policy not
allowing retry
2024-06-18 01:18:27,202 ERROR application-EventThread: Watcher exception
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /...
at org.apache.zookeeper.KeeperException.create(KeeperException.java:133)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:53)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972)
at
org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)
at
org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
at
org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)
at
org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)
at
org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:35)
at
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.readValue(ConnectionStateAwareSharedValue.java:253)
at
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.access$200(ConnectionStateAwareSharedValue.java:26)
at
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue$1.process(ConnectionStateAwareSharedValue.java:45)
at
org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:83)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:563)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:538)
{quote}
was:
h2. TL;DR
After upgrade Apache Zookeeper & Apache Curator, our application is
experiencing a new unexpected behaviour after events such as:
* Zookeeper server(s) being restarted
* Zookeeper server failing & being replaced
h2. Background
Application is using Zookeeper for leader election & metadata storage. The
application runs on 3 hosts which each also have 1 Zookeeper server/daemon
running.
Previously the application was running on Zookeeper version 3.5.10 & Curator
version 4.3.0
After upgrading to Zookeeper version 3.9.1 & Curator version 5.2.0 a new edge
case was observed where after Zookeeper daemons are restarted/failed the
application (i.e. Zookeeper client) enters a 15+ minute loop of repeatedly
logging "{{{}SessionExpiredException"{}}}
These repeated "{{{}SessionExpiredException"{}}} are not indicative of a full
Zookeeper client/communication outage because DEBUG logs show that other
Zookeeper sessions are communicating just fine. The
"{{{}SessionExpiredException"{}}} logs unfortunately do not show the associated
Session ID which is encountering the "{{{}SessionExpiredException"{}}}
h2. Symptoms
When using Zookeeper version 3.9.1 & Curator version 5.2.0, after
restarting/failing some of the Zookeeper daemons:
# All the 3 zookeeper clients experience some connections failures lasting a
few seconds after the Zookeeper daemons were failed/restarted.
# These connection failure issues are resolved shortly without any action
needed.
# Around 1 minute after the Zookeeper daemons were failed/restarted, all the 3
zookeeper clients start repeatedly logging "{{{}SessionExpiredException"{}}}
# The "{{{}SessionExpiredException"{}}} is repeatedly logged for 15+ minutes.
During this time there are no connectivity issues. We can see from the
Zookeeper server logs that all 3 Zookeeper servers are receiving regular
traffic from the clients.
# Interestingly, each Zookeeper server is not receiving any requests from the
local Zookeeper client for the duration of the period where
"{{{}SessionExpiredException"{}}} is repeatedly logged by the clients. However,
each Zookeeper server is receiving regular traffic from the 2 remote Zookeeper
clients.
The evidence suggests that this is a client-side issue & the
"{{{}SessionExpiredException"{}}} is being thrown before the request is even
sent to the Zookeeper server.
h2. Root Cause Theory
Based on my analysis of the available DEBUG & TRACE level logs, my current
working theory is that the Zookeeper client is repeatedly trying to use a
connection/session in CLOSED state which is causing the client to repeatedly
throw "{{{}SessionExpiredException"{}}} for each retry.
The first "{{{}SessionExpiredException"{}}} is thrown after a log pattern like
the following:
{quote}2024-06-18 01:14:45,694 WARN Curator-ConnectionStateManager-0: Session
timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed
ms: 60000. Adjusted session timeout ms: 60000
2024-06-18 01:14:45,694 INFO Curator-ConnectionStateManager-0: ZK
getBindInformation:
ip-172-31-200-85.ec2.internal:22181,ip-172-31-200-230.ec2.internal:22181,ip-172-31-200-39.ec2.internal:22181
2024-06-18 01:14:45,695 INFO Curator-ConnectionStateManager-0:
injectSessionExpiration() called
2024-06-18 01:14:45,695 INFO Curator-ConnectionStateManager-0: channel is told
closing
2024-06-18 01:14:45,698 INFO epollEventLoopGroup-3-1: channel is disconnected:
[id: 0x86a592c8, L:/172.31.200.230:35836 !
R:ip-172-31-200-230.ec2.internal/172.31.200.230:22181]
2024-06-18 01:14:45,698 INFO epollEventLoopGroup-3-1: channel is told closing
2024-06-18 01:14:45,698 TRACE
application-SendThread(ip-172-31-200-230.ec2.internal:22181): SendThread exited
loop for session: 0x1000000fc490005
2024-06-18 01:14:45,703 DEBUG application-EventThread: Retry-able exception
received
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /EMR/StateStore/ApplicationsAttempt
at org.apache.zookeeper.KeeperException.create(KeeperException.java:133)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:53)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972)
at
org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)
at
org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
at
org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)
at
org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)
at
org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:35)
at
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.readValue(ConnectionStateAwareSharedValue.java:253)
at
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.access$200(ConnectionStateAwareSharedValue.java:26)
at
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue$1.process(ConnectionStateAwareSharedValue.java:45)
at
org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:83)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:563)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:538)
{quote}
It seems the initial Zookeeper daemon restart/failure is causing the Curator
client to "{_}injectSessionExpiration(){_}" which is then causing a session
"{_}0x1000000fc490005{_}" to have its SendThread exited.
[The injectSessionExpiration method is then putting the ClientCnxn object for
the session into CLOSED
state.|https://github.com/apache/zookeeper/blob/39973dc5fddf778733e4f0370980629c07a16d67/zookeeper-server/src/main/java/org/apache/zookeeper/ZooKeeperTestable.java#L41]
This is causing [the following code path to throw
"{{{}SessionExpiredException"{}}} before the request is even sent to the
Zookeeper
server|https://github.com/apache/zookeeper/blob/39973dc5fddf778733e4f0370980629c07a16d67/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L777]
Then 25 minutes later, the EventThread for the Zookeeper session is finally
also stopped:
{quote}2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
GetDataBuilderImpl-Background - 1163211 ms
2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
GetChildrenBuilderImpl-Background - 1163211 ms
2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
CreateBuilderImpl-Background - 1163211 ms
2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
GetChildrenBuilderImpl-Background - 1003198 ms
2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
GetDataBuilderImpl-Background - 383164 ms
2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
GetChildrenBuilderImpl-Background - 383164 ms
2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
CreateBuilderImpl-Background - 383164 ms
2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
GetChildrenBuilderImpl-Background - 266157 ms
2024-06-18 01:40:43,942 INFO application-EventThread: EventThread shut down for
session: 0x1000000fc490005
{quote}
Only after the EventThread is stopped do the "{{{}SessionExpiredException"{}}}
stop occurring.
This leads me to believe that the lingering EventThread for the closed session
is leading to the closed session being used which is leading to the repeated
"{{{}SessionExpiredException"{}}} which occur until the EventThread is stopped.
In between "2024-06-18 01:14:45" & "2024-06-18 01:40:43" we see 0 requests from
the local client "{_}172.31.200.230{_}" reaching the Zookeeper server also on
"{_}172.31.200.230{_}". However, other sessions to other Zookeeper servers are
unaffected & operate without connection or session expiry issues.
h2. Additional Notes
* We can see the stack trace from "Zookeeper.getData" down to "conPacketLoss"
here:
**
[Zookeeper.getData|https://github.com/apache/zookeeper/blob/39973dc5fddf778733e4f0370980629c07a16d67/zookeeper-server/src/main/java/org/apache/zookeeper/ZooKeeper.java#L2013]
**
[ClientCnxn.submitRequest|https://github.com/apache/zookeeper/blob/39973dc5fddf778733e4f0370980629c07a16d67/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1516]
**
[ClientCnxn.queuePacket|https://github.com/apache/zookeeper/blob/39973dc5fddf778733e4f0370980629c07a16d67/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1621]
**
[ClientCnxn.conPacketLoss|https://github.com/apache/zookeeper/blob/39973dc5fddf778733e4f0370980629c07a16d67/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L777]
* I have validated the Curator RetryPolicy configuration is not the problem
here. Each "{{{}SessionExpiredException"{}}} is retried 5 times within under 1
minute before being surfaced as a "Watcher exception":
{quote}2024-06-18 01:17:42,200 INFO application-EventThread: ZK
getBindInformation:
ip-172-31-200-85.ec2.internal:22181,ip-172-31-200-230.ec2.internal:22181,ip-172-31-200-39.ec2.internal:22181
2024-06-18 01:17:42,200 DEBUG application-EventThread: Retry-able exception
received
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /EMR/StateStore/Data/RemoteAndSlaveRecords
at org.apache.zookeeper.KeeperException.create(KeeperException.java:133)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:53)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972)
at
org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)
at
org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
at
org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)
at
org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)
at
org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:35)
at
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.readValue(ConnectionStateAwareSharedValue.java:253)
at
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.access$200(ConnectionStateAwareSharedValue.java:26)
at
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue$1.process(ConnectionStateAwareSharedValue.java:45)
at
org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:83)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:563)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:538)
...
2024-06-18 01:18:27,202 TRACE application-EventThread: Counter retries-allowed:
1
2024-06-18 01:18:27,202 DEBUG application-EventThread: Retrying operation
2024-06-18 01:18:27,202 INFO application-EventThread: ZK getBindInformation:
ip-172-31-200-85.ec2.internal:22181,ip-172-31-200-230.ec2.internal:22181,ip-172-31-200-39.ec2.internal:22181
2024-06-18 01:18:27,202 DEBUG application-EventThread: Retry-able exception
received
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /EMR/StateStore/Data/RemoteAndSlaveRecords
...
2024-06-18 01:18:27,202 TRACE application-EventThread: Counter
retries-disallowed: 1
2024-06-18 01:18:27,202 DEBUG application-EventThread: Retry policy not
allowing retry
2024-06-18 01:18:27,202 ERROR application-EventThread: Watcher exception
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /EMR/StateStore/Data/RemoteAndSlaveRecords
at org.apache.zookeeper.KeeperException.create(KeeperException.java:133)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:53)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972)
at
org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)
at
org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
at
org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)
at
org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)
at
org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:35)
at
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.readValue(ConnectionStateAwareSharedValue.java:253)
at
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.access$200(ConnectionStateAwareSharedValue.java:26)
at
org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue$1.process(ConnectionStateAwareSharedValue.java:45)
at
org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:83)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:563)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:538)
{quote}
> Repeated SessionExpiredException after Zookeeper daemon restart
> ---------------------------------------------------------------
>
> Key: ZOOKEEPER-4840
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4840
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.9.1
> Reporter: Kevin Wikant
> Priority: Major
>
> h2. TL;DR
> After upgrade Apache Zookeeper & Apache Curator, our application is
> experiencing a new unexpected behaviour after events such as:
> * Zookeeper server(s) being restarted
> * Zookeeper server failing & being replaced
> h2. Background
> Application is using Zookeeper for leader election & metadata storage. The
> application runs on 3 hosts which each also have 1 Zookeeper server/daemon
> running.
> Previously the application was running on Zookeeper version 3.5.10 & Curator
> version 4.3.0
> After upgrading to Zookeeper version 3.9.1 & Curator version 5.2.0 a new edge
> case was observed where after Zookeeper daemons are restarted/failed the
> application (i.e. Zookeeper client) enters a 15+ minute loop of repeatedly
> logging "{{{}SessionExpiredException"{}}}
> These repeated "{{{}SessionExpiredException"{}}} are not indicative of a full
> Zookeeper client/communication outage because DEBUG logs show that other
> Zookeeper sessions are communicating just fine. The
> "{{{}SessionExpiredException"{}}} logs unfortunately do not show the
> associated Session ID which is encountering the
> "{{{}SessionExpiredException"{}}}
> h2. Symptoms
> When using Zookeeper version 3.9.1 & Curator version 5.2.0, after
> restarting/failing some of the Zookeeper daemons:
> # All the 3 zookeeper clients experience some connections failures lasting a
> few seconds after the Zookeeper daemons were failed/restarted.
> # These connection failure issues are resolved shortly without any action
> needed.
> # Around 1 minute after the Zookeeper daemons were failed/restarted, all the
> 3 zookeeper clients start repeatedly logging "{{{}SessionExpiredException"{}}}
> # The "{{{}SessionExpiredException"{}}} is repeatedly logged for 15+
> minutes. During this time there are no connectivity issues. We can see from
> the Zookeeper server logs that all 3 Zookeeper servers are receiving regular
> traffic from the clients.
> # Interestingly, each Zookeeper server is not receiving any requests from
> the local Zookeeper client for the duration of the period where
> "{{{}SessionExpiredException"{}}} is repeatedly logged by the clients.
> However, each Zookeeper server is receiving regular traffic from the 2 remote
> Zookeeper clients.
> The evidence suggests that this is a client-side issue & the
> "{{{}SessionExpiredException"{}}} is being thrown before the request is even
> sent to the Zookeeper server.
> h2. Root Cause Theory
> Based on my analysis of the available DEBUG & TRACE level logs, my current
> working theory is that the Zookeeper client is repeatedly trying to use a
> connection/session in CLOSED state which is causing the client to repeatedly
> throw "{{{}SessionExpiredException"{}}} for each retry.
> The first "{{{}SessionExpiredException"{}}} is thrown after a log pattern
> like the following:
> {quote}2024-06-18 01:14:45,694 WARN Curator-ConnectionStateManager-0: Session
> timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed
> ms: 60000. Adjusted session timeout ms: 60000
> 2024-06-18 01:14:45,694 INFO Curator-ConnectionStateManager-0: ZK
> getBindInformation:
> ip-172-31-200-85.ec2.internal:22181,ip-172-31-200-230.ec2.internal:22181,ip-172-31-200-39.ec2.internal:22181
> 2024-06-18 01:14:45,695 INFO Curator-ConnectionStateManager-0:
> injectSessionExpiration() called
> 2024-06-18 01:14:45,695 INFO Curator-ConnectionStateManager-0: channel is
> told closing
> 2024-06-18 01:14:45,698 INFO epollEventLoopGroup-3-1: channel is
> disconnected: [id: 0x86a592c8, L:/172.31.200.230:35836 !
> R:ip-172-31-200-230.ec2.internal/172.31.200.230:22181]
> 2024-06-18 01:14:45,698 INFO epollEventLoopGroup-3-1: channel is told closing
> 2024-06-18 01:14:45,698 TRACE
> application-SendThread(ip-172-31-200-230.ec2.internal:22181): SendThread
> exited loop for session: 0x1000000fc490005
> 2024-06-18 01:14:45,703 DEBUG application-EventThread: Retry-able exception
> received
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode
> = Session expired for /...
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:133)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:53)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972)
> at
> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)
> at
> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)
> at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
> at
> org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)
> at
> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)
> at
> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:35)
> at
> org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.readValue(ConnectionStateAwareSharedValue.java:253)
> at
> org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.access$200(ConnectionStateAwareSharedValue.java:26)
> at
> org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue$1.process(ConnectionStateAwareSharedValue.java:45)
> at
> org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:83)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:563)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:538)
> {quote}
> It seems the initial Zookeeper daemon restart/failure is causing the Curator
> client to "{_}injectSessionExpiration(){_}" which is then causing a session
> "{_}0x1000000fc490005{_}" to have its SendThread exited.
> [The injectSessionExpiration method is then putting the ClientCnxn object for
> the session into CLOSED
> state.|https://github.com/apache/zookeeper/blob/39973dc5fddf778733e4f0370980629c07a16d67/zookeeper-server/src/main/java/org/apache/zookeeper/ZooKeeperTestable.java#L41]
> This is causing [the following code path to throw
> "{{{}SessionExpiredException"{}}} before the request is even sent to the
> Zookeeper
> server|https://github.com/apache/zookeeper/blob/39973dc5fddf778733e4f0370980629c07a16d67/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L777]
> Then 25 minutes later, the EventThread for the Zookeeper session is finally
> also stopped:
> {quote}2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
> GetDataBuilderImpl-Background - 1163211 ms
> 2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
> GetChildrenBuilderImpl-Background - 1163211 ms
> 2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
> CreateBuilderImpl-Background - 1163211 ms
> 2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
> GetChildrenBuilderImpl-Background - 1003198 ms
> 2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
> GetDataBuilderImpl-Background - 383164 ms
> 2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
> GetChildrenBuilderImpl-Background - 383164 ms
> 2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
> CreateBuilderImpl-Background - 383164 ms
> 2024-06-18 01:40:43,942 TRACE application-EventThread: Trace:
> GetChildrenBuilderImpl-Background - 266157 ms
> 2024-06-18 01:40:43,942 INFO application-EventThread: EventThread shut down
> for session: 0x1000000fc490005
> {quote}
> Only after the EventThread is stopped do the
> "{{{}SessionExpiredException"{}}} stop occurring.
> This leads me to believe that the lingering EventThread for the closed
> session is leading to the closed session being used which is leading to the
> repeated "{{{}SessionExpiredException"{}}} which occur until the EventThread
> is stopped.
> In between "2024-06-18 01:14:45" & "2024-06-18 01:40:43" we see 0 requests
> from the local client "{_}172.31.200.230{_}" reaching the Zookeeper server
> also on "{_}172.31.200.230{_}". However, other sessions to other Zookeeper
> servers are unaffected & operate without connection or session expiry issues.
> h2. Additional Notes
> * We can see the stack trace from "Zookeeper.getData" down to
> "conPacketLoss" here:
> **
> [Zookeeper.getData|https://github.com/apache/zookeeper/blob/39973dc5fddf778733e4f0370980629c07a16d67/zookeeper-server/src/main/java/org/apache/zookeeper/ZooKeeper.java#L2013]
> **
> [ClientCnxn.submitRequest|https://github.com/apache/zookeeper/blob/39973dc5fddf778733e4f0370980629c07a16d67/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1516]
> **
> [ClientCnxn.queuePacket|https://github.com/apache/zookeeper/blob/39973dc5fddf778733e4f0370980629c07a16d67/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1621]
> **
> [ClientCnxn.conPacketLoss|https://github.com/apache/zookeeper/blob/39973dc5fddf778733e4f0370980629c07a16d67/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L777]
> * I have validated the Curator RetryPolicy configuration is not the problem
> here. Each "{{{}SessionExpiredException"{}}} is retried 5 times within under
> 1 minute before being surfaced as a "Watcher exception":
> {quote}2024-06-18 01:17:42,200 INFO application-EventThread: ZK
> getBindInformation:
> ip-172-31-200-85.ec2.internal:22181,ip-172-31-200-230.ec2.internal:22181,ip-172-31-200-39.ec2.internal:22181
> 2024-06-18 01:17:42,200 DEBUG application-EventThread: Retry-able exception
> received
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode
> = Session expired for /...
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:133)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:53)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972)
> at
> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)
> at
> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)
> at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
> at
> org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)
> at
> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)
> at
> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:35)
> at
> org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.readValue(ConnectionStateAwareSharedValue.java:253)
> at
> org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.access$200(ConnectionStateAwareSharedValue.java:26)
> at
> org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue$1.process(ConnectionStateAwareSharedValue.java:45)
> at
> org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:83)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:563)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:538)
> ...
> 2024-06-18 01:18:27,202 TRACE application-EventThread: Counter
> retries-allowed: 1
> 2024-06-18 01:18:27,202 DEBUG application-EventThread: Retrying operation
> 2024-06-18 01:18:27,202 INFO application-EventThread: ZK getBindInformation:
> ip-172-31-200-85.ec2.internal:22181,ip-172-31-200-230.ec2.internal:22181,ip-172-31-200-39.ec2.internal:22181
> 2024-06-18 01:18:27,202 DEBUG application-EventThread: Retry-able exception
> received
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode
> = Session expired for /...
> ...
> 2024-06-18 01:18:27,202 TRACE application-EventThread: Counter
> retries-disallowed: 1
> 2024-06-18 01:18:27,202 DEBUG application-EventThread: Retry policy not
> allowing retry
> 2024-06-18 01:18:27,202 ERROR application-EventThread: Watcher exception
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode
> = Session expired for /...
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:133)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:53)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972)
> at
> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)
> at
> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)
> at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
> at
> org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)
> at
> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)
> at
> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:35)
> at
> org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.readValue(ConnectionStateAwareSharedValue.java:253)
> at
> org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue.access$200(ConnectionStateAwareSharedValue.java:26)
> at
> org.apache.curator.framework.recipes.shared.ConnectionStateAwareSharedValue$1.process(ConnectionStateAwareSharedValue.java:45)
> at
> org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:83)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:563)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:538)
> {quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)