[ https://issues.apache.org/jira/browse/CURATOR-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072222#comment-17072222 ]
Jordan Zimmerman edited comment on CURATOR-525 at 3/31/20, 10:49 PM: --------------------------------------------------------------------- What I'm thinking is that we have two options: * Rewrite connection handling in Curator (likely to introduce new problems, very risky given the age/brittleness of the code) * Add a quick fix. Here's the quick fix: ConnectionStateManager must make sure the CuratorZookeeperClient is "reset" when LOST is set. This would ensure that there is never a zombie Curator instance. More: if ConnectionStateManager.checkSessionExpiration() notices that the current state is LOST but {{client.getZookeeperClient().isConnected()}} returns true - then it can take action by resetting the connection. This is a hack of course, but would be a quick fix. Thoughts? was (Author: randgalt): What I'm thinking is that we have two options: * Rewrite connection handling in Curator (likely to introduce new problems, very risky given the age/brittleness of the code) * Add a quick fix. Here's the quick fix: ConnectionStateManager must make sure the CuratorZookeeperClient is "reset" when LOST is set. This would ensure that there is never a zombie Curator instance. > There is a race condition in Curator which might lead to fake SUSPENDED event > and ruin CuratorFrameworkImpl inner state > ------------------------------------------------------------------------------------------------------------------------ > > Key: CURATOR-525 > URL: https://issues.apache.org/jira/browse/CURATOR-525 > Project: Apache Curator > Issue Type: Bug > Components: Framework > Affects Versions: 4.2.0 > Reporter: Mikhail Valiev > Priority: Critical > Attachments: CuratorFrameworkTest.java, > background-thread-infinite-loop.png, curator-race-condition.png, > event-watcher-thread.png > > > This was originally found in the 2.11.1 version of Curator, but I tested the > latest release as well, and the issue is still there. > The issue is tied to guaranteed deletes and how it loops infinitely, if > called when there is no connection: > client.delete().guaranteed().forPath(ourPath); > [https://curator.apache.org/apidocs/org/apache/curator/framework/api/GuaranteeableDeletable.html] > This schedules a background operation which attempts to remove the node in > infinite loop. Each time a background operation fails due to connection loss > it performs a check (validateConnection() function) to see if the main thread > is already aware of connection loss, and if it's not - raises the connection > loss event. The problem is that this peace of code is also executed by the > event watcher thread when connection events are happening - this leads to > race condition. So when connection is restored it's easily possible for the > main thread to raise RECONNECTED event and after that for background thread > to raise SUSPENDED event. > We might get unlucky and get a "phantom" SUSPENDED event. It breaks Curator > inner Connection state and leads to curator behaving unpredictably > Attached some illustrations and Unit test to reproduce the issue. (Put debug > point in validateConnection() ) > *Possible solution*: in CuratorFrameworkImpl class adjust the processEvent() > function and add the following: > if(event.getType() == CuratorEventType.SYNC) { > connectionStateManager.addStateChange(ConnectionState.RECONNECTED); > } > If this is a same state as before - it will be ignored, if background > operation succeeded, but we are in SUSPENDED state - this would repair the > Curator state and raise RECONNECTED event. > -- This message was sent by Atlassian Jira (v8.3.4#803005)