[jira] [Comment Edited] (CURATOR-525) There is a race condition in Curator which might lead to fake SUSPENDED event and ruin CuratorFrameworkImpl inner state

Chao Chu (Jira) Tue, 31 Mar 2020 13:21:39 -0700


    [ 
https://issues.apache.org/jira/browse/CURATOR-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072125#comment-17072125
 ]


Chao Chu edited comment on CURATOR-525 at 3/31/20, 8:20 PM:
------------------------------------------------------------

we have hit the same issue as [~jschlather] mentioned above, where we got the 
fake LOST event during previous SendThread's cleanup (in a call chain like 
cleanup -> conLossPacket -> finishPacket -> queuePacket -> processEvent), which 
then calls back into background retry and the operation's session expired got 
translated to a fake LOST event in validateConnection at the end of 
CuratorFrameworkImpl.checkBackgroundRetry.

 

The extra LOST event got passed to ConnectionStateManager after the new 
SendThread have established a new session, which led to a broken 
CuratorFrameworkImpl instance stayed in LOST state thereafter.

 

it's not clear to me why the checkBackgroundRetry need call the 
validateConnection at the end?

 

we could probably fix this by change the `validateConnection`, to be more 
defensive, it should distinguish if it's called from state change directly or, 
got the state change translated from a KeeperException's error code, i.e., via 
the codeToState(...) call, for the latter, before fire the LOST, we should 
check the underlying curator client's connection state, something like below:

https://gist.github.com/chuchao333/920aa6c90697ea0f06ca24939aed67f0

what do you think?


was (Author: chuchao333):
we have hit the same issue as [~jschlather] mentioned above, where we got the 
fake LOST event during previous SendThread's cleanup (in a call chain like 
cleanup -> conLossPacket -> finishPacket -> queuePacket -> processEvent), which 
then calls back into background retry and the operation's session expired got 
translated to a fake LOST event in validateConnection at the end of 
CuratorFrameworkImpl.checkBackgroundRetry.

 

The extra LOST event got passed to ConnectionStateManager after the new 
SendThread have established a new session, which led to a broken 
CuratorFrameworkImpl instance stayed in LOST state thereafter.

 

it's not clear to me why the checkBackgroundRetry need call the 
validateConnection at the end?

 

we could probably fix this by change the `validateConnection`, to be more 
defensive, it should distinguish if it's called from state change directly or, 
got the state change translated from a KeeperException's error code, i.e., via 
the codeToState(...) call, for the latter, before fire the LOST, we should 
check the underlying curator client's connection state, something like below:

 

void validateConnection(KeeperState sate, boolean fromCodeToState) {

  ...

  else if (state == Watcher.Event.KeeperState.Expired)
  {
    if (fromCodeToState && client.isConnected())
       // ignore
    else 
       // fire a LOST as it is now       
      connectionStateManager.addStateChange(ConnectionState.LOST);   }
} 

 

what do you think?

> There is a race condition in Curator which might lead to fake SUSPENDED event 
> and ruin CuratorFrameworkImpl inner state 
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CURATOR-525
>                 URL: https://issues.apache.org/jira/browse/CURATOR-525
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Framework
>    Affects Versions: 4.2.0
>            Reporter: Mikhail Valiev
>            Priority: Critical
>         Attachments: CuratorFrameworkTest.java, 
> background-thread-infinite-loop.png, curator-race-condition.png, 
> event-watcher-thread.png
>
>
> This was originally found in the 2.11.1 version of Curator, but I tested the 
> latest release as well, and the issue is still there.
> The issue is tied to guaranteed deletes and how it loops infinitely, if 
> called when there is no connection:
> client.delete().guaranteed().forPath(ourPath); 
> [https://curator.apache.org/apidocs/org/apache/curator/framework/api/GuaranteeableDeletable.html]
> This schedules a background operation which attempts to remove the node in 
> infinite loop. Each time a background operation fails due to connection loss 
> it performs a check (validateConnection() function) to see if the main thread 
> is already aware of connection loss, and if it's not - raises the connection 
> loss event. The problem is that this peace of code is also executed by the 
> event watcher thread when connection events are happening - this leads to 
> race condition. So when connection is restored it's easily possible for the 
> main thread to raise RECONNECTED event and after that for background thread 
> to raise SUSPENDED event.
> We might get unlucky and get a "phantom" SUSPENDED event. It breaks Curator 
> inner Connection state and leads to curator behaving unpredictably
> Attached some illustrations and Unit test to reproduce the issue. (Put debug 
> point in validateConnection() )
> *Possible solution*: in CuratorFrameworkImpl class adjust the processEvent() 
> function and add the following:
> if(event.getType() == CuratorEventType.SYNC) {
> connectionStateManager.addStateChange(ConnectionState.RECONNECTED);
> }
> If this is a same state as before - it will be ignored, if background 
> operation succeeded, but we are in SUSPENDED state - this would repair the 
> Curator state and raise RECONNECTED event.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (CURATOR-525) There is a race condition in Curator which might lead to fake SUSPENDED event and ruin CuratorFrameworkImpl inner state

Reply via email to