[
https://issues.apache.org/jira/browse/CURATOR-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489524#comment-17489524
]
Matthias Sohn commented on CURATOR-627:
---------------------------------------
It seems we are hit by this bug as well.
We use a curator cache to keep track of the cluster state in zookeeper of an
ensemble of Gerrit servers.
We tested a network partition where one of the application nodes can't connect
to zookeeper anymore.
When the network partition ended curator seems not to reconnect and hangs in
state "CONNECTING" with isConnected is false.
Is there an ETA for release 5.2.1 ?
> Curator remains in SUSPENDED state ignoring session timeout or network
> recovery
> -------------------------------------------------------------------------------
>
> Key: CURATOR-627
> URL: https://issues.apache.org/jira/browse/CURATOR-627
> Project: Apache Curator
> Issue Type: Bug
> Components: Framework
> Affects Versions: 4.2.0
> Environment: OS: Linux version 4.17.3-1.el7.elrepo.x86_64
> Reporter: Bai Yu
> Priority: Major
> Attachments: log analysis.md, log4j.log, pom.xml
>
>
> h1. Description
> Our program encountered a problem where curator got stuck in the SUSPENDED
> state. It kept injecting "session expire" event, generating logs like this
> every 15s (same as session timeout):
> {quote}Session timeout has elapsed while SUSPENDED. Injecting a session
> expiration{quote}
> However, curator never transited to the LOST state after session timeout, nor
> transied to the RECONNECTED state upon network recovery.
>
> According to logs containing "zookeeper" or "curator" (see attachment
> "log4j.log"), related events are as follows:
> (All events happened in date 2021.12.05)
> * 23:34:07,122: ZooKeeper not heard from server, thus curator transited to
> the SUSPENDED state.
> * 23:34:09,128: ZooKeeper opening socket to zk server.
> * 23:34:22,223: curator injected a "session expiration" event, but had not
> transited to the LOST state at that time; never transited in the future.
> * 23:34:22,885: ZooKeeper opened socket to zk server, and received reponse
> indicating session expiration; thus, the current ZooKeeper object was closed.
> However, another ZooKeeper object was not created as expected.
> * 23:34:37,224 and later: curator kept injecting "session expiration" events
> every session timeout (15s). The log was filled with lines like this:
> "Session timeout has elapsed while SUSPENDED. Injecting a session expiration."
> In summary, curator stayed in the SUSPENED state and never transited to the
> LOST or RECONNECTED state. Besides, the underlying ZooKeeper object was never
> recreated according to logs and the "jstack" command. For a more detailed
> analysis, please refer to attachment "log analysis.md".
>
> The problem above was encountered only once in our testing environment within
> months, and has never occurred in our production enviromnent. We failed to
> find out how to reproduce, but suspect there is a racing condition when these
> events happen simultaneously: while curator is injecting a "session expire"
> event, the underlying "ClientCnxnSocketNIO" has just reconnected to a zk
> server and got a "session expire" response.
> h1. Environment
> OS: Linux version 4.17.3-1.el7.elrepo.x86_64
>
> Project type: maven
> Curator version: 4.2.0
>
> Session timeout: 15s, which is equal to that negotiated with the server.
> Retry policy: no retry, where RetryPolicy#allowRetry() always returns false.
> That's because we handle retry at application level. Write operations should
> NOT be retried immediately after reconnected, until some extra validations
> pass.
>
> Our program creates exactly one "CuratorFramework" instance connecting to the
> only zk server in the testing environment. The server's connection string is
> "6x.xx.xx.27:2181", with some digits consealed for security. Ip addresses in
> logs are also such processed.
>
> ZkClient is imported for forward compatiblility with some legacy codes (still
> dependent on ZkClient), but it is bridged to curator instead of connecting to
> zk server itself. For more details about dependencies, please refer to file
> "pom.xml" in attachments.
> h1. Attachments
> log4j.log: log of our program, with ip addresses and zk paths consealed.
> log analysis.md: a more detailed analysis about the logs.
> pom.xml: part of file "pom.xml", showing zookeeper-related dependencies.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)