zhanglu153 created CURATOR-722:
----------------------------------
Summary: Zookeeper connection leak after session expiration
Key: CURATOR-722
URL: https://issues.apache.org/jira/browse/CURATOR-722
Project: Apache Curator
Issue Type: Bug
Components: Client
Affects Versions: 2.13.0, 2.7.1
Reporter: zhanglu153
Attachments: testCuratorClient.java
*User testing code description:*
The test code is in testCuratorClient.java.
When creating the Curator client,the user added the CuratorListener in the test
code to listen for the AuthFailed event generated when the client SASL
authentication fails.
When the listener detects the AuthFailed event, it will close the Curator
client, delete the created node, and enter a while loop.
In the while loop, rebuild a Curator client, add the CuratorListener, start the
Curator client, and create a znode with sasl permission under the /test node
with sasl permission. After successful execution, the listener code exits.
When kerberos remains unavailable, this loop will continue to block the
handling of AuthFailed events.
Add a lock to this listener to prevent the AuthFailed event that occurs after
the newly established Curator client starts from entering this listener again,
avoiding unnecessary while loops.
*Scenario of Connection Leakage Issue:*
* The Zookeeper client successfully connected to the zookeeper server with
session ID 0x0 and created a znode with SASL permission.
* Create an exception, temporarily disconnect the session from the server.
* The session state in the Curator changes to suspended. The client is
preparing to reconnect to the server. At
org.apache.zookeeper.ClientCnxn.SendThread#startConnect method debugs
breakpoints and stops the kerberos service, waiting for the server to determine
that the session has expired.
* Continue executing code after the kerberos service has stopped and the
session has expired. The client will continue connection to Zookeeper server
without SASL authentication and send an AuthFailed event.
* The listener set by the user will listen for the AuthFailed event and start
processing the code logic in the listener.
* Before closing the Curator client in the listener, the session is found to
have expired, and the client sends the Expired event and eventOfDeath.
* At the same time, in the Curator framework,
org.apache.curator.ConnectionState#checkTimeouts method detected a connection
timeout and called the reset method to close the old session 0x0. Since session
0x0 has expired and the connection status has been set to CLOSED, when calling
close() to release resources, this.cnxn.getState().isAlive() will return fasle.
It was found that the session 0x0 had been closed, and a new Zookeeper object
was created to establish a new session 0x1.
* Restore the kerberos service.
* The listener set by the user will close the session 0x1 of the Curator
client, rebuild a new Curator client, start a new session 0x2, and successfully
create a znode with SASL permission.
* The SendThread of session 0x0 was closed after the session expired, and
EventThread has not completed execution yet because the eventOfDeath object has
not been processed yet. The old org.apache.curator.ConnectionState object in
the Curator has been held by Expired event. When the Curator receives the
Expired event, it will call the reset method again, causing the Curator object
that has already called closeAndClear to establish a new session 0x3.
At this point, both the leaked session 0x3 and the session 0x2 which the user
needs to rebuild are connected to the server simultaneously.
There is a zookeeper connection leak when using a Curator in this scenario. The
Curator framework should not allow the framework to call the reset method again
to restart a leaked connection after receiving an Expired event, after the user
has already called the close method to close and clean up resources. The
priority of users calling the close method should be higher than the
framework's handling of Expired events. At the same time, I found that there is
a similar issue with curator4.x in [CURATOR-437] zookeeper connection leak when
session expires. - ASF JIRA (apache.org).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)