[
https://issues.apache.org/jira/browse/CURATOR-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072222#comment-17072222
]
Jordan Zimmerman edited comment on CURATOR-525 at 3/31/20, 11:42 PM:
---------------------------------------------------------------------
What I'm thinking is that we have two options:
* Rewrite connection handling in Curator (likely to introduce new problems,
very risky given the age/brittleness of the code)
* Add a quick fix.
Here's the quick fix: ConnectionStateManager must make sure the
CuratorZookeeperClient is "reset" when LOST is set. This would ensure that
there is never a zombie Curator instance.
The fix would look like this:
https://gist.github.com/Randgalt/4e84f3f46fd19267f02f7fe57a06e51a
An important question is how to test this. We haven't been able to come up with
a test case that shows the problem.
Note: I just tried this change with [~jschlather] contrived test and it does
fix the problem. This is a hack of course, but would be a quick fix. Thoughts?
was (Author: randgalt):
What I'm thinking is that we have two options:
* Rewrite connection handling in Curator (likely to introduce new problems,
very risky given the age/brittleness of the code)
* Add a quick fix.
Here's the quick fix: ConnectionStateManager must make sure the
CuratorZookeeperClient is "reset" when LOST is set. This would ensure that
there is never a zombie Curator instance.
The fix would look like this:
https://gist.github.com/Randgalt/4e84f3f46fd19267f02f7fe57a06e51a
An important question is how to test this. We haven't been able to come up with
a test case that shows the problem.
More: if ConnectionStateManager.checkSessionExpiration() notices that the
current state is LOST but {{client.getZookeeperClient().isConnected()}} returns
true - then it can take action by resetting the connection. This is a hack of
course, but would be a quick fix. Thoughts?
Note: I just tried this change with [~jschlather] contrived test and it does
fix the problem.
> There is a race condition in Curator which might lead to fake SUSPENDED event
> and ruin CuratorFrameworkImpl inner state
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: CURATOR-525
> URL: https://issues.apache.org/jira/browse/CURATOR-525
> Project: Apache Curator
> Issue Type: Bug
> Components: Framework
> Affects Versions: 4.2.0
> Reporter: Mikhail Valiev
> Priority: Critical
> Attachments: CuratorFrameworkTest.java,
> background-thread-infinite-loop.png, curator-race-condition.png,
> event-watcher-thread.png
>
>
> This was originally found in the 2.11.1 version of Curator, but I tested the
> latest release as well, and the issue is still there.
> The issue is tied to guaranteed deletes and how it loops infinitely, if
> called when there is no connection:
> client.delete().guaranteed().forPath(ourPath);
> [https://curator.apache.org/apidocs/org/apache/curator/framework/api/GuaranteeableDeletable.html]
> This schedules a background operation which attempts to remove the node in
> infinite loop. Each time a background operation fails due to connection loss
> it performs a check (validateConnection() function) to see if the main thread
> is already aware of connection loss, and if it's not - raises the connection
> loss event. The problem is that this peace of code is also executed by the
> event watcher thread when connection events are happening - this leads to
> race condition. So when connection is restored it's easily possible for the
> main thread to raise RECONNECTED event and after that for background thread
> to raise SUSPENDED event.
> We might get unlucky and get a "phantom" SUSPENDED event. It breaks Curator
> inner Connection state and leads to curator behaving unpredictably
> Attached some illustrations and Unit test to reproduce the issue. (Put debug
> point in validateConnection() )
> *Possible solution*: in CuratorFrameworkImpl class adjust the processEvent()
> function and add the following:
> if(event.getType() == CuratorEventType.SYNC) {
> connectionStateManager.addStateChange(ConnectionState.RECONNECTED);
> }
> If this is a same state as before - it will be ignored, if background
> operation succeeded, but we are in SUSPENDED state - this would repair the
> Curator state and raise RECONNECTED event.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)