[
https://issues.apache.org/jira/browse/ZOOKEEPER-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17946919#comment-17946919
]
Kezhu Wang commented on ZOOKEEPER-4921:
---------------------------------------
[~tranvanchuong1995] Thank you for your sharing! You could upload log file
directly to jira.
3.9.3:
{noformat}
o.a.z.ClientCnxn$ConnectionTimeoutException: Client connection timed out, have
not heard from server in 2667ms for session id 0x100707c2dc261da
{noformat}
3.9.2:
{noformat}
Session 0x100707c2dc26281 for server XXX:2181, Closing socket connection.
Attempting reconnect except it is a SessionExpiredException.
{noformat}
Sessions will expired finally in your cases. In case of 3.9.3, it is
*considered expired* by client, while in 3.9.2, it is *confirmed expired* by
server. There is no much differences in these two cases from client's
perspective except *the later will be delayed substantially if cluster is
unreachable*, just like this case.
{noformat}
With 3.9.2, it will retry really hard until we connect to the VPN:
{"2025-04-23T19:42:25.543-07:00","Initiating client connection,
connectString=XXX:2181 sessionTimeout=4000 " }
{"2025-04-23T19:42:25.557-07:00", "Socket connection established, initiating
session, client: /10.4.9.178:61917, server: XXX/YYY:2181" }
{"2025-04-23T19:42:25.571-07:00", "Session establishment complete on server
XXX/YYY:2181, session id = 0x100707c2dc26286, negotiated timeout = 4000" }
{noformat}
It will retry hard(*actually endless*) to get reconnected and to know its
expiration. The next client will success immediately in your case as network
has been repaired.
3.9.3:
{noformat}
2025-04-23T18:51:41.595-07:00, Initiating client connection,
connectString=XXX:2181 sessionTimeout=4000
2025-04-23T18:51:41.600-07:00, Opening socket connection to server XXX/YYYY:2181
2025-04-23T18:51:45.613-07:00, Client connection timed out, have not heard from
server in 4005ms for session id 0x0
2025-04-23T18:51:46.730-07:00, Opening socket connection to server XXX/YYYY:2181
2025-04-23T18:51:50.732-07:00, Client session timed out, have not heard from
server in 9134ms for session id 0x0
{noformat}
>From the log, I can tell the it try twice before exhausting expiration
>timeout(a.k.a. 4/3 * 4000ms).
3.9.2
{noformat}
Session establishment complete on server XXX/YYY:2181, session id =
0x100707c2dc26286, negotiated timeout = 4000
{noformat}
In 3.9.2, the client will keeping retry(*actuall endless*) until a brand new
session established while 3.9.3 does not.
Those endless retry is what ZOOKEEPER-4508 try to fix so client can get prompt
notification according to session timeout and react somehow.
I think you could mimic 3.9.2 behavior by looping new session establishment so
you could get a brand new session after network repaired.
> Zookeeper Client 3.9.3 Fails to Reconnect After Network Failures
> ----------------------------------------------------------------
>
> Key: ZOOKEEPER-4921
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4921
> Project: ZooKeeper
> Issue Type: Bug
> Components: java client
> Affects Versions: 3.9.3
> Reporter: Chuong Tran
> Priority: Critical
>
> After upgrading the Java Zookeeper client to version 3.9.3, we observed that
> it is not resilient to brief network disruptions, such as a short VPN blip.
> In such cases, the client attempts to reconnect only once, and if
> unsuccessful, the session expires.
> {quote}Apr 23, 2025 10:19:23 AM
> com.twitter.finagle.common.zookeeper.ZooKeeperClient$3 process
> INFO: Zookeeper session expired. Event: WatchedEvent state:Expired type:None
> path:null zxid: -1
> {quote}
> In contrast, the previous version (3.9.2) would continuously retry until the
> network connection was restored, maintaining the session more reliably.
> I believe it's a new issue with this change:
> https://issues.apache.org/jira/browse/ZOOKEEPER-4508
>
> Step to repro:
> # Open VPN.
> # Start the application which connects to the Zookeeper server with the VPN.
> # Disable VPN for a couple of minutes.
> # Observe the application.
> # Enable the VPN again.
> {quote}3.9.3:
> "message" : "Session 0x0 for server XXX, Closing socket connection.
> Attempting reconnect except it is a SessionExpiredException or
> SessionTimeoutException.",
> "stackTrace" : "o.a.z.ClientCnxn$SessionTimeoutException: Client session
> timed out, have not heard from server in 5590ms for session id 0x0
> at o.a.z.ClientCnxn$SendThread.run(ClientCnxn.java:1253)
> {quote}
> 3.9.2: Application will be reconnected successfully.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)