Bai Yu created CURATOR-627:
------------------------------
Summary: Curator remains in SUSPENDED state ignoring session
timeout or network recovery
Key: CURATOR-627
URL: https://issues.apache.org/jira/browse/CURATOR-627
Project: Apache Curator
Issue Type: Bug
Components: Framework
Affects Versions: 4.2.0
Environment: OS: Linux version 4.17.3-1.el7.elrepo.x86_64
Reporter: Bai Yu
Attachments: log analysis.md, log4j.log, pom.xml
h1. Description
Our program encountered a problem where curator got stuck in the SUSPENDED
state. It kept injecting "session expire" event, generating logs like this
every 15s (same as session timeout):
{quote}Session timeout has elapsed while SUSPENDED. Injecting a session
expiration{quote}
However, curator never transited to the LOST state after session timeout, nor
transied to the RECONNECTED state upon network recovery.
According to logs containing "zookeeper" or "curator" (see attachment
"log4j.log"), related events are as follows:
(All events happened in date 2021.12.05)
* 23:34:07,122: ZooKeeper not heard from server, thus curator transited to the
SUSPENDED state.
* 23:34:09,128: ZooKeeper opening socket to zk server.
* 23:34:22,223: curator injected a "session expiration" event, but had not
transited to the LOST state at that time; never transited in the future.
* 23:34:22,885: ZooKeeper opened socket to zk server, and received reponse
indicating session expiration; thus, the current ZooKeeper object was closed.
However, another ZooKeeper object was not created as expected.
* 23:34:37,224 and later: curator kept injecting "session expiration" events
every session timeout (15s). The log was filled with lines like this: "Session
timeout has elapsed while SUSPENDED. Injecting a session expiration."
In summary, curator stayed in the SUSPENED state and never transited to the
LOST or RECONNECTED state. Besides, the underlying ZooKeeper object was never
recreated according to logs and the "jstack" command. For a more detailed
analysis, please refer to attachment "log analysis.md".
The problem above was encountered only once in our testing environment within
months, and has never occurred in our production enviromnent. We failed to find
out how to reproduce, but suspect there is a racing condition when these events
happen simultaneously: while curator is injecting a "session expire" event, the
underlying "ClientCnxnSocketNIO" has just reconnected to a zk server and got a
"session expire" response.
h1. Environment
OS: Linux version 4.17.3-1.el7.elrepo.x86_64
Project type: maven
Curator version: 4.2.0
Session timeout: 15s, which is equal to that negotiated with the server.
Retry policy: no retry, where RetryPolicy#allowRetry() always returns false.
That's because we handle retry at application level. Write operations should
NOT be retried immediately after reconnected, until some extra validations pass.
Our program creates exactly one "CuratorFramework" instance connecting to the
only zk server in the testing environment. The server's connection string is
"6x.xx.xx.27:2181", with some digits consealed for security. Ip addresses in
logs are also such processed.
ZkClient is imported for forward compatiblility with some legacy codes (still
dependent on ZkClient), but it is bridged to curator instead of connecting to
zk server itself. For more details about dependencies, please refer to file
"pom.xml" in attachments.
h1. Attachments
log4j.log: log of our program, with ip addresses and zk paths consealed.
log analysis.md: a more detailed analysis about the logs.
pom.xml: part of file "pom.xml", showing zookeeper-related dependencies.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)