Bai Yu created CURATOR-627:
------------------------------

             Summary: Curator remains in SUSPENDED state ignoring session 
timeout or network recovery
                 Key: CURATOR-627
                 URL: https://issues.apache.org/jira/browse/CURATOR-627
             Project: Apache Curator
          Issue Type: Bug
          Components: Framework
    Affects Versions: 4.2.0
         Environment: OS: Linux version 4.17.3-1.el7.elrepo.x86_64
            Reporter: Bai Yu
         Attachments: log analysis.md, log4j.log, pom.xml

h1. Description

Our program encountered a problem where curator got stuck in the SUSPENDED 
state. It kept injecting "session expire" event, generating logs like this 
every 15s (same as session timeout):
{quote}Session timeout has elapsed while SUSPENDED. Injecting a session 
expiration{quote}
However, curator never transited to the LOST state after session timeout, nor 
transied to the RECONNECTED state upon network recovery. 

 

According to logs containing "zookeeper" or "curator" (see attachment 
"log4j.log"), related events are as follows:

(All events happened in date 2021.12.05)
 * 23:34:07,122: ZooKeeper not heard from server, thus curator transited to the 
SUSPENDED state.

 * 23:34:09,128: ZooKeeper opening socket to zk server.

 * 23:34:22,223: curator injected a "session expiration" event, but had not 
transited to the LOST state at that time; never transited in the future.

 * 23:34:22,885: ZooKeeper opened socket to zk server, and received reponse 
indicating session expiration; thus, the current ZooKeeper object was closed. 
However, another ZooKeeper object was not created as expected.

 * 23:34:37,224 and later: curator kept injecting "session expiration" events 
every session timeout (15s). The log was filled with lines like this: "Session 
timeout has elapsed while SUSPENDED. Injecting a session expiration."

In summary, curator stayed in the SUSPENED state and never transited to the 
LOST or RECONNECTED state. Besides, the underlying ZooKeeper object was never 
recreated according to logs and the "jstack" command. For a more detailed 
analysis, please refer to attachment "log analysis.md".

 

The problem above was encountered only once in our testing environment within 
months, and has never occurred in our production enviromnent. We failed to find 
out how to reproduce, but suspect there is a racing condition when these events 
happen simultaneously: while curator is injecting a "session expire" event, the 
underlying "ClientCnxnSocketNIO" has just reconnected to a zk server and got a 
"session expire" response.
h1. Environment

OS: Linux version 4.17.3-1.el7.elrepo.x86_64

 

Project type: maven

Curator version: 4.2.0

 

Session timeout: 15s, which is equal to that negotiated with the server.

Retry policy: no retry, where RetryPolicy#allowRetry() always returns false. 
That's because we handle retry at application level. Write operations should 
NOT be retried immediately after reconnected, until some extra validations pass.

 

Our program creates exactly one "CuratorFramework" instance connecting to the 
only zk server in the testing environment. The server's connection string is 
"6x.xx.xx.27:2181", with some digits consealed for security. Ip addresses in 
logs are also such processed. 

 

ZkClient is imported for forward compatiblility with some legacy codes (still 
dependent on ZkClient), but it is bridged to curator instead of connecting to 
zk server itself. For more details about dependencies, please refer to file 
"pom.xml" in attachments.
h1. Attachments

log4j.log: log of our program, with ip addresses and zk paths consealed.

log analysis.md: a more detailed analysis about the logs.

pom.xml: part of file "pom.xml", showing zookeeper-related dependencies.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to