Patrik Ivarsson created ZOOKEEPER-4893:
------------------------------------------
Summary: Excessive reconection delays due to hardcoded sleep
intervals
Key: ZOOKEEPER-4893
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4893
Project: ZooKeeper
Issue Type: Improvement
Components: java client
Affects Versions: 3.9.3
Reporter: Patrik Ivarsson
*Description*
I'll try to explain our issue as clearly as I can.
Some clients take too long to reconnect to a ZooKeeper cluster after a minor
downtime. We have identified two hardcoded sleep intervals in the client
connection logic that contribute to this issue, but they cannot be configured.
Spending several seconds in this disconnected state, even though the cluster is
up and healthy is an issue in our setup.
*These are the two Thread.sleep() which I am referring to*
1. Random sleep (0-1000ms) before attempting a new connection:
* [ClientCnxn.java#L1138
(release-3.9.3)|https://github.com/apache/zookeeper/blob/release-3.9.3/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1138]
2. Fixed 1000ms sleep before reconnecting to the last known server:
* [StaticHostProvider#L363
(release-3.9.3)|https://github.com/apache/zookeeper/blob/release-3.9.3/zookeeper-server/src/main/java/org/apache/zookeeper/client/StaticHostProvider.java#L362]
*Example Scenario*
Consider a three-node ZooKeeper cluster (node01, node02, node03) where node01
is currently the leader.
1. Event: Node01 is temporarily taken down for short maintenance (e.g. for
security patching).
2. Result: The remaining nodes (node02 and node03) elect a new leader,
completing within ~1000ms.
3. Client (that was connected to node01) behavior (worst-case scenario):
* Connection to node01 is lost → client enters a suspended state.
* Waits 500ms, attempts connection to node02 → fails (cluster not ready).
* Waits 499ms, attempts connection to node03 → fails (cluster still not
ready).
* Waits 1000ms as we are now back to original node01 (sleep #2 in the list
above)
* Waits 1000ms before connecting to node01 -> fails (this node is down for
maintenance)
* Waits 1000ms before retrying node02 → finally succeeds.
4. Total reconnection time: ~4 seconds, despite the cluster being available
after just 1 second.
*Impact*
* Clients remain in a suspended state longer than necessary, leading to
degraded service availability.
* The reconnection delay is artificially inflated due to hardcoded sleeps.
*Suggested improvement*
* Give the user an option to provide our own logic for how long we should sleep
before retry. It could be making these sleep intervals configurable, but even
better would be to be able to provide our own implementation of the waiting
logic.
*Offer to contribute*
We would be happy to submit a pull request to address this issue if that would
be helpful. Please let us know if a contribution would be welcomed and if you
have any guidance on the preferred approach.
Would appreciate any insights from the maintainers. Thanks!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)