[jira] [Commented] (CURATOR-460) Timed tolerance for connection suspended leads to simultaneous leaders

ASF GitHub Bot (JIRA) Wed, 04 Apr 2018 17:11:47 -0700

    [ 
https://issues.apache.org/jira/browse/CURATOR-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426355#comment-16426355
 ]


ASF GitHub Bot commented on CURATOR-460:
----------------------------------------

Github user arrodrigues commented on a diff in the pull request:

    https://github.com/apache/curator/pull/262#discussion_r179319231
  
    --- Diff: 
curator-recipes/src/test/java/org/apache/curator/framework/recipes/leader/TestLeaderLatch.java
 ---
    @@ -222,6 +223,35 @@ public void notLeader()
                 next.add(states.poll(timing.forSessionSleep().milliseconds(), 
TimeUnit.MILLISECONDS));
                 next.add(states.poll(timing.forSessionSleep().milliseconds(), 
TimeUnit.MILLISECONDS));
                 
Assert.assertTrue(next.equals(Arrays.asList(ConnectionState.LOST.name(), 
"false")) || next.equals(Arrays.asList("false", ConnectionState.LOST.name())), 
next.toString());
    +
    +            latch.close();
    +            client.close();
    +
    +            timing.sleepABit();
    +            states.clear();
    +            server = new TestingServer();
    +            client = CuratorFrameworkFactory.builder()
    +                    .connectString(server.getConnectString())
    +                    .connectionTimeoutMs(1000)
    +                    .sessionTimeoutMs(timing.session())
    +                    .retryPolicy(new RetryOneTime(1))
    +                    .connectionStateErrorPolicy(new 
SessionConnectionStateErrorPolicy())
    +                    .connectionHandlingPolicy(new 
StandardConnectionHandlingPolicy(30))
    +                    .build();
    +            
client.getConnectionStateListenable().addListener(stateListener);
    +            client.start();
    +            latch = new LeaderLatch(client, "/test");
    +            latch.addListener(listener);
    +            latch.start();
    +            
Assert.assertEquals(states.poll(timing.forWaiting().milliseconds(), 
TimeUnit.MILLISECONDS), ConnectionState.CONNECTED.name());
    +            
Assert.assertEquals(states.poll(timing.forWaiting().milliseconds(), 
TimeUnit.MILLISECONDS), "true");
    +            server.close();
    +            
Assert.assertEquals(states.poll(timing.forWaiting().milliseconds(), 
TimeUnit.MILLISECONDS), ConnectionState.SUSPENDED.name());
    +            next = Lists.newArrayList();
    +
    +            next.add(states.poll(timing.session() / 3, 
TimeUnit.MILLISECONDS));
    +            next.add(states.poll(timing.forSleepingABit().milliseconds(), 
TimeUnit.MILLISECONDS));
    +            
Assert.assertTrue(next.equals(Arrays.asList(ConnectionState.LOST.name(), 
"false")) || next.equals(Arrays.asList("false", ConnectionState.LOST.name())), 
next.toString());
    --- End diff --
    
    OK. I'm going to adjust that.


> Timed tolerance for connection suspended leads to simultaneous leaders
> ----------------------------------------------------------------------
>
>                 Key: CURATOR-460
>                 URL: https://issues.apache.org/jira/browse/CURATOR-460
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 3.3.0, 4.0.0, 4.0.1
>            Reporter: Antonio Rafael Rodrigues
>            Assignee: Cameron McKenzie
>            Priority: Major
>         Attachments: ConnectionStateManager.patch, 
> Console-leaderSelector1.log, Console-leaderSelector2.log, 
> CuratorLeaderSelectorPOC.java, LogApp1.log, LogApp2.log
>
>
> Starting from Curator 3, after losing the connection to zookeeper, curator 
> emits a SUSPENDED event and then starts an internal timer, if the time of the 
> negotiatied session timeout get hit and curator didn't reconnect to zookeeper 
> it emits a LOST event.
> For example :
>  Given negotiated session timeout = 40
> ||Time (seconds)||Event||
> |0|SUSPENDED (curator has been disconnected from zookeeper)|
> |40|LOST (curator is still disconnected, it must have been lost the sesion as 
> it is disconnected for 40sec)|
> Given this scenario we could ,theoretically, ignore the SUSPENDED event and 
> consider the leadership as lost just if the ConnectionStateListener receives 
> the LOST event.
> But this feature seems to have introduced a bug (from my point of view)
> *Case of failure*
> ||Time (seconds)||Event||
> |0|Something went wrong with the connected zookeeper (in my case, the network 
> inteface of zookeeper's server has gone down). Curator stops hearing heart 
> beats from zookeeper, but doesn't lose the connection (from some reason that 
> I don't know, if the network interface of the server goes down, Curator 
> doesn't lose the connection)|
> |~26.666|SUSPENDED (after 26 seconds without hearing heartbeats, curator 
> emits a SUSPENDED event) 26 is from "readTimeout = sessionTimeout * 2 / 3" 
> from the class ClientCnxn.java from zookeeper client. At this point, curator 
> starts counting 40 sec.|
> |26.666 to 40|During this period, Curator is trying to connect to other 
> zookeeper instances but the other instances, in my example case, are also 
> uncheachable.|
> |40|Session has expired and the other instance has taken leadership ( in my 
> example, the other instance can connect to zookeeper )|
> |66.666|LOST ( after 40sec from SUSPENDED, curator finally sends LOST)|
> As you can see, if we are ignoring the SUSPENDED event, the second 
> application instance acquires the leadership ~26 seconds before the first 
> instance notice that it lost the leadership.
>  I understand it seems to be a very rare case, but either way I think it 
> should be addressed.
> *Reproduce it*
> I have came up with a way to reproduce this easily. I know this is not a 
> recommended production setup, but it reproduces the problem anyway.
> 1) On a virtual machine, with 2 network interfaces (eth0[192.168.0.101], 
> eth1[192.168.0.102]) , I installed one zookeeper instance.
>  2) I setup application1 with the leadership receipe, with 40sec of 
> negotiated timeout, pointing just to 192.168.0.101  .Now it is the leader
>  3) I setup application2 with the leadership receipe, with 40sec of 
> negotiated timeout, pointing just to 192.168.0.102  .Now it it is NOT the 
> leader
>  4) On the server I turn the eth0[192.168.0.101] interface down  [ ifconfig 
> eth0 down ]
>  5) After 26 seconds, application1 says : 
>            :ClientCnxn$SendThread@1111] - Client session timed out, have not 
> heard from server in 26679ms for sessionid
>             ConnectionStateManager@237] - State change: SUSPENDED
>         NOTE: I'm ignoring the SUSPENDED event
> 6) After 40 seconds, application2 takes leadership
>  7) After 66 seconds, application1 says :
>              ConnectionStateManager@237] - State change: LOST
>          
>          NOTE: Just at this point, I consider that the application1 has lost 
> leadership
> Then, for 26 seconds, we had 2 leaders.
> If you confirm it as a bug, I think I could help.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CURATOR-460) Timed tolerance for connection suspended leads to simultaneous leaders

Reply via email to