[
https://issues.apache.org/jira/browse/CURATOR-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485359#comment-17485359
]
Nathan Gough commented on CURATOR-627:
--------------------------------------
Some of our users have also seen this in Apache NiFi which uses Curator 4.2.0
(we're scheduled to upgrade to 5.2.0 in our next release):
https://issues.apache.org/jira/browse/NIFI-9559
> Curator remains in SUSPENDED state ignoring session timeout or network
> recovery
> -------------------------------------------------------------------------------
>
> Key: CURATOR-627
> URL: https://issues.apache.org/jira/browse/CURATOR-627
> Project: Apache Curator
> Issue Type: Bug
> Components: Framework
> Affects Versions: 4.2.0
> Environment: OS: Linux version 4.17.3-1.el7.elrepo.x86_64
> Reporter: Bai Yu
> Priority: Major
> Attachments: log analysis.md, log4j.log, pom.xml
>
>
> h1. Description
> Our program encountered a problem where curator got stuck in the SUSPENDED
> state. It kept injecting "session expire" event, generating logs like this
> every 15s (same as session timeout):
> {quote}Session timeout has elapsed while SUSPENDED. Injecting a session
> expiration{quote}
> However, curator never transited to the LOST state after session timeout, nor
> transied to the RECONNECTED state upon network recovery.
>
> According to logs containing "zookeeper" or "curator" (see attachment
> "log4j.log"), related events are as follows:
> (All events happened in date 2021.12.05)
> * 23:34:07,122: ZooKeeper not heard from server, thus curator transited to
> the SUSPENDED state.
> * 23:34:09,128: ZooKeeper opening socket to zk server.
> * 23:34:22,223: curator injected a "session expiration" event, but had not
> transited to the LOST state at that time; never transited in the future.
> * 23:34:22,885: ZooKeeper opened socket to zk server, and received reponse
> indicating session expiration; thus, the current ZooKeeper object was closed.
> However, another ZooKeeper object was not created as expected.
> * 23:34:37,224 and later: curator kept injecting "session expiration" events
> every session timeout (15s). The log was filled with lines like this:
> "Session timeout has elapsed while SUSPENDED. Injecting a session expiration."
> In summary, curator stayed in the SUSPENED state and never transited to the
> LOST or RECONNECTED state. Besides, the underlying ZooKeeper object was never
> recreated according to logs and the "jstack" command. For a more detailed
> analysis, please refer to attachment "log analysis.md".
>
> The problem above was encountered only once in our testing environment within
> months, and has never occurred in our production enviromnent. We failed to
> find out how to reproduce, but suspect there is a racing condition when these
> events happen simultaneously: while curator is injecting a "session expire"
> event, the underlying "ClientCnxnSocketNIO" has just reconnected to a zk
> server and got a "session expire" response.
> h1. Environment
> OS: Linux version 4.17.3-1.el7.elrepo.x86_64
>
> Project type: maven
> Curator version: 4.2.0
>
> Session timeout: 15s, which is equal to that negotiated with the server.
> Retry policy: no retry, where RetryPolicy#allowRetry() always returns false.
> That's because we handle retry at application level. Write operations should
> NOT be retried immediately after reconnected, until some extra validations
> pass.
>
> Our program creates exactly one "CuratorFramework" instance connecting to the
> only zk server in the testing environment. The server's connection string is
> "6x.xx.xx.27:2181", with some digits consealed for security. Ip addresses in
> logs are also such processed.
>
> ZkClient is imported for forward compatiblility with some legacy codes (still
> dependent on ZkClient), but it is bridged to curator instead of connecting to
> zk server itself. For more details about dependencies, please refer to file
> "pom.xml" in attachments.
> h1. Attachments
> log4j.log: log of our program, with ip addresses and zk paths consealed.
> log analysis.md: a more detailed analysis about the logs.
> pom.xml: part of file "pom.xml", showing zookeeper-related dependencies.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)