[
https://issues.apache.org/jira/browse/CURATOR-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
zhanglu153 updated CURATOR-722:
-------------------------------
Attachment: testCuratorClient.java
> Zookeeper connection leak after session expiration
> --------------------------------------------------
>
> Key: CURATOR-722
> URL: https://issues.apache.org/jira/browse/CURATOR-722
> Project: Apache Curator
> Issue Type: Bug
> Components: Client
> Affects Versions: 2.7.1, 2.12.0, 2.13.0
> Reporter: zhanglu153
> Priority: Major
> Attachments: ConnectionState.patch, testCuratorClient.java
>
>
> *User testing code description:*
> The test code is in [^testCuratorClient.java].
> When creating the Curator client,the user added the CuratorListener in the
> test code to listen for the AuthFailed event generated when the client SASL
> authentication fails.
> When the listener detects the AuthFailed event, it will close the Curator
> client, delete the created node, and enter a while loop.
> In the while loop, rebuild a Curator client, add the CuratorListener, start
> the Curator client, and create a znode with sasl permission under the /test
> node with sasl permission. After successful execution, the listener code
> exits.
> When kerberos remains unavailable, this loop will continue to block the
> handling of AuthFailed events.
> Add a lock to this listener to prevent the AuthFailed event that occurs after
> the newly established Curator client starts from entering this listener
> again, avoiding unnecessary while loops.
> *Scenario of Connection Leakage Issue:*
> * The Zookeeper client successfully connected to the zookeeper server with
> session ID 0x0 and created a znode with SASL permission.
> * Create an exception, temporarily disconnect the session from the server.
> * The session state in the Curator changes to suspended. The client is
> preparing to reconnect to the server. At
> org.apache.zookeeper.ClientCnxn.SendThread#startConnect method debugs
> breakpoints and stops the kerberos service, waiting for the server to
> determine that the session has expired.
> * Continue executing code after the kerberos service has stopped and the
> session has expired. The client will continue connection to Zookeeper server
> without SASL authentication and send an AuthFailed event.
> * The listener set by the user will listen for the AuthFailed event and
> start processing the code logic in the listener.
> * Before closing the Curator client in the listener, the session is found to
> have expired, and the client sends the Expired event and eventOfDeath.
> * At the same time, in the Curator framework,
> org.apache.curator.ConnectionState#checkTimeouts method detected a connection
> timeout and called the reset method to close the old session 0x0. Since
> session 0x0 has expired and the connection status has been set to CLOSED,
> when calling close() to release resources, this.cnxn.getState().isAlive()
> will return fasle. It was found that the session 0x0 had been closed, and a
> new Zookeeper object was created to establish a new session 0x1.
> * Restore the kerberos service.
> * The listener set by the user will close the session 0x1 of the Curator
> client, rebuild a new Curator client, start a new session 0x2, and
> successfully create a znode with SASL permission.
> * The SendThread of session 0x0 was closed after the session expired, and
> EventThread has not completed execution yet because the eventOfDeath object
> has not been processed yet. The old org.apache.curator.ConnectionState object
> in the Curator has been held by Expired event. When the Curator receives the
> Expired event, it will call the reset method again, causing the Curator
> object that has already called closeAndClear to establish a new session 0x3.
> At this point, both the leaked session 0x3 and the session 0x2 which the user
> needs to rebuild are connected to the server simultaneously.
> There is a zookeeper connection leak when using a Curator in this scenario.
> The Curator framework should not allow the framework to call the reset method
> again to restart a leaked connection after receiving an Expired event, after
> the user has already called the close method to close and clean up resources.
> The priority of users calling the close method should be higher than the
> framework's handling of Expired events. At the same time, I found that there
> is a similar issue with curator4.x in CURATOR-437 zookeeper connection leak
> when session expires.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)