[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17946772#comment-17946772
 ] 

Christopher Tubbs commented on ZOOKEEPER-4921:
----------------------------------------------

The Accumulo team has noticed this and has developed a solution, a "ZooSession" 
object that mostly mimics the API of ZooKeeper client object, and automatically 
reconnects. However, it needs additional work to be fully resilient to 
exceptions thrown due to transient failures. Some of that is encapsulated in a 
helper "ZooReader" and "ZooReaderWriter" objects which retry operations on 
transient errors, and plan to merge those into the ZooSession feature. I think 
if we can make it resilient enough to be generally useful, it is something that 
we could contribute directly to the ZooKeeper project. However, that requires 
some time to polish up the API and behavior, and we have a lot of other 
development tasks we're working on.

But, I think the idea is sound. It's very weird, after all, that the ZK client 
is useless after network failures. It's a bit like having to restart your 
browser every time you navigate to website with an error. That shouldn't be 
necessary.

You may want to take a look at the workarounds that the Accumulo project has 
come up with for handling these transient issues.

> Zookeeper Client 3.9.3 Fails to Reconnect After Network Failures
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4921
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4921
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: java client
>    Affects Versions: 3.9.3
>            Reporter: Chuong Tran
>            Priority: Critical
>
> After upgrading the Java Zookeeper client to version 3.9.3, we observed that 
> it is not resilient to brief network disruptions, such as a short VPN blip. 
> In such cases, the client attempts to reconnect only once, and if 
> unsuccessful, the session expires.
> {quote}Apr 23, 2025 10:19:23 AM 
> com.twitter.finagle.common.zookeeper.ZooKeeperClient$3 process
> INFO: Zookeeper session expired. Event: WatchedEvent state:Expired type:None 
> path:null zxid: -1
> {quote}
> In contrast, the previous version (3.9.2) would continuously retry until the 
> network connection was restored, maintaining the session more reliably.
> I believe it's a new issue with this change: 
> https://issues.apache.org/jira/browse/ZOOKEEPER-4508
>  
> Step to repro:
>  # Open VPN.
>  # Start the application which connects to the Zookeeper server with the VPN.
>  # Disable VPN for a couple of minutes.
>  # Observe the application.
>  # Enable the VPN again.
> {quote}3.9.3:
> "message" : "Session 0x0 for server XXX, Closing socket connection. 
> Attempting reconnect except it is a SessionExpiredException or 
> SessionTimeoutException.",
>   "stackTrace" : "o.a.z.ClientCnxn$SessionTimeoutException: Client session 
> timed out, have not heard from server in 5590ms for session id 0x0
>         at o.a.z.ClientCnxn$SendThread.run(ClientCnxn.java:1253)
> {quote}
> 3.9.2: Application will be reconnected successfully.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to