[jira] [Comment Edited] (CURATOR-525) There is a race condition in Curator which might lead to fake SUSPENDED event and ruin CuratorFrameworkImpl inner state

Jordan Zimmerman (Jira) Tue, 31 Mar 2020 16:43:38 -0700


    [ 
https://issues.apache.org/jira/browse/CURATOR-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072222#comment-17072222
 ]


Jordan Zimmerman edited comment on CURATOR-525 at 3/31/20, 11:42 PM:
---------------------------------------------------------------------

What I'm thinking is that we have two options:

* Rewrite connection handling in Curator (likely to introduce new problems, 
very risky given the age/brittleness of the code)
* Add a quick fix. 

Here's the quick fix: ConnectionStateManager must make sure the 
CuratorZookeeperClient is "reset" when LOST is set. This would ensure that 
there is never a zombie Curator instance. 

The fix would look like this: 
https://gist.github.com/Randgalt/4e84f3f46fd19267f02f7fe57a06e51a

An important question is how to test this. We haven't been able to come up with 
a test case that shows the problem.

Note: I just tried this change with [~jschlather] contrived test and it does 
fix the problem. This is a hack of course, but would be a quick fix. Thoughts?


was (Author: randgalt):
What I'm thinking is that we have two options:

* Rewrite connection handling in Curator (likely to introduce new problems, 
very risky given the age/brittleness of the code)
* Add a quick fix. 

Here's the quick fix: ConnectionStateManager must make sure the 
CuratorZookeeperClient is "reset" when LOST is set. This would ensure that 
there is never a zombie Curator instance. 

The fix would look like this: 
https://gist.github.com/Randgalt/4e84f3f46fd19267f02f7fe57a06e51a

An important question is how to test this. We haven't been able to come up with 
a test case that shows the problem.

More: if ConnectionStateManager.checkSessionExpiration() notices that the 
current state is LOST but {{client.getZookeeperClient().isConnected()}} returns 
true - then it can take action by resetting the connection. This is a hack of 
course, but would be a quick fix. Thoughts?

Note: I just tried this change with [~jschlather] contrived test and it does 
fix the problem.

> There is a race condition in Curator which might lead to fake SUSPENDED event 
> and ruin CuratorFrameworkImpl inner state 
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CURATOR-525
>                 URL: https://issues.apache.org/jira/browse/CURATOR-525
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Framework
>    Affects Versions: 4.2.0
>            Reporter: Mikhail Valiev
>            Priority: Critical
>         Attachments: CuratorFrameworkTest.java, 
> background-thread-infinite-loop.png, curator-race-condition.png, 
> event-watcher-thread.png
>
>
> This was originally found in the 2.11.1 version of Curator, but I tested the 
> latest release as well, and the issue is still there.
> The issue is tied to guaranteed deletes and how it loops infinitely, if 
> called when there is no connection:
> client.delete().guaranteed().forPath(ourPath); 
> [https://curator.apache.org/apidocs/org/apache/curator/framework/api/GuaranteeableDeletable.html]
> This schedules a background operation which attempts to remove the node in 
> infinite loop. Each time a background operation fails due to connection loss 
> it performs a check (validateConnection() function) to see if the main thread 
> is already aware of connection loss, and if it's not - raises the connection 
> loss event. The problem is that this peace of code is also executed by the 
> event watcher thread when connection events are happening - this leads to 
> race condition. So when connection is restored it's easily possible for the 
> main thread to raise RECONNECTED event and after that for background thread 
> to raise SUSPENDED event.
> We might get unlucky and get a "phantom" SUSPENDED event. It breaks Curator 
> inner Connection state and leads to curator behaving unpredictably
> Attached some illustrations and Unit test to reproduce the issue. (Put debug 
> point in validateConnection() )
> *Possible solution*: in CuratorFrameworkImpl class adjust the processEvent() 
> function and add the following:
> if(event.getType() == CuratorEventType.SYNC) {
> connectionStateManager.addStateChange(ConnectionState.RECONNECTED);
> }
> If this is a same state as before - it will be ignored, if background 
> operation succeeded, but we are in SUSPENDED state - this would repair the 
> Curator state and raise RECONNECTED event.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (CURATOR-525) There is a race condition in Curator which might lead to fake SUSPENDED event and ruin CuratorFrameworkImpl inner state

Reply via email to