[ 
https://issues.apache.org/jira/browse/CURATOR-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043067#comment-17043067
 ] 

Cam McKenzie commented on CURATOR-561:
--------------------------------------

I'm just trying to reproduce this at the moment. I'm positive that I debugged 
something very similar to this years ago, but never got to the bottom of it. I 
think that the proposed solution should work, and agree that the flag should be 
bound to the EventThread instance.

> Race condition preventing reconnection
> --------------------------------------
>
>                 Key: CURATOR-561
>                 URL: https://issues.apache.org/jira/browse/CURATOR-561
>             Project: Apache Curator
>          Issue Type: Bug
>    Affects Versions: 4.2.0
>            Reporter: Grant Digby
>            Assignee: Jordan Zimmerman
>            Priority: Major
>
> Heavy GC caused several rounds of suspend, lost, reconnected - as you'd 
> expect. But eventually it failed to reconnect, even though ZK was up the 
> entire time. My thread waited on acquireLock until the retry period was up. I 
> believe I've tracked this down to a race condition between the Curator thread 
> which calls ZookKeeperTestable#injectSessionExpiration to post the expiry and 
> death events and the Zookeeper SendThread which posts the same.
>  
> The EventThread is meant to receive the expiry event first so it can set up 
> its replacement, then it receives the death event making it shut down.
> The race condition allows these two events to be received in the wrong order 
> so the EventThread shuts down before a replacement can be set up.
>  
> The "Curator-ConnectionStateManager" thread calls injectSessionExpiration 
> which calls EventThread#queueEvent. It goes as far as my comment in the next 
> code snippit, sets the status and let's say it pauses here, for whatever 
> reason.
>  
> {code:java}
> private void queueEvent(WatchedEvent event,
>         Set<Watcher> materializedWatchers) {
>     if (event.getType() == EventType.None
>             && sessionState == event.getState()) {
>         return;
>     }
>     sessionState = event.getState();   // The curator thread sets the status 
> here
>   // It then goes onto send the expiry event, but for now it's here..the 
> event has not yet been sent{code}
>  
> The SendThread then goes into the same method with its Expiry message, but 
> the state has already been set to expired by the curator thread so it returns 
> without sending the expiry message. It then sends the event of death:
> {code:java}
> // from ClientCnxn#onConnected
> eventThread.queueEvent(new WatchedEvent(
>         Watcher.Event.EventType.None,
>         Watcher.Event.KeeperState.Expired, null));
> eventThread.queueEventOfDeath(); {code}
>  
> The EventThread receives this event of death and shuts down. The curator 
> thread then continues and sends the original expiry message but it's now too 
> late. With no EventThread created it remains in suspended state with any 
> operations just retrying until they give up.
>  
> I find it difficult to explan race conditions, but hopefully that made sense. 
> I can recreate the issue in Intellij using Suspend-All break points to cause 
> the session expiry and Suspend-Thread break points to order the threads in 
> the correct way:
>  
> Run the following and put a suspend-all break point at the System.out
> {code:java}
> package com.gebatech.curator;
> import org.apache.curator.framework.CuratorFramework;
> import org.apache.curator.framework.CuratorFrameworkFactory;
> import org.apache.curator.framework.recipes.locks.InterProcessReadWriteLock;
> import org.apache.curator.retry.BoundedExponentialBackoffRetry;
> /**
>  *
>  */
> public class GoCurator {
>     public static void main(String[] args) throws Exception {
>         CuratorFramework cf = CuratorFrameworkFactory.builder()
>                 .connectString("localhost:2181")
>                 .retryPolicy(new BoundedExponentialBackoffRetry(200, 10000, 
> 20))
>                 .sessionTimeoutMs(30 * 1000)
>                 .build();
>         cf.start();
>         InterProcessReadWriteLock lcok = new InterProcessReadWriteLock(cf, 
> "/grant/myLock");
>         System.out.println("Set a Suspend-ALL break point here and leave it 
> until the session expires, 30 seconds ish");
>         lcok.readLock().acquire();
>         System.out.println();
>     }
> } {code}
> Whlst suspended here, put a Suspend-Thread break poing in 
> EventThread#queueEvent, on the if statement from the first code snippet.
>  
> When you see the following log line, you can let the program resume;
> {code:java}
> 2020-02-20 20:15:05,912 [myid:] - INFO [SessionTracker:ZooKeeperServer@398] - 
> Expiring session 0x1006300f2390000, timeout of 30000ms exceeded {code}
> The SendThread will hit your break point first with a Disconnect event, you 
> can let this run.
> Next you'll get the SendThread with the Expiry event, wait here until you get 
> the popup telling you that the Curator thread has hit the same break point. 
> Switch to this and step through until it has set the expiry status, then 
> leave it held there and switch back to the SendThread. Let the sendThread 
> run, then remove break points and let everything else run.
>  
> The program will then be stuck and you'll see this stacktrace for the acquire 
> lock method
> {code:java}
> "main@1" prio=5 tid=0x1 nid=NA waiting"main@1" prio=5 tid=0x1 nid=NA waiting  
> java.lang.Thread.State: WAITING   at java.lang.Object.wait(Object.java:-1)   
> at java.lang.Object.wait(Object.java:502)   at 
> org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1499)   at 
> org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1487)   at 
> org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1547)   at 
> org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180)
>    at 
> org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156)
>    at 
> org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
>    at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)   at 
> org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153)
>    at 
> org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:607)
>    at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:597)
>    at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:575)
>    at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:51)
>    at 
> org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54)
>    at 
> org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:225)
>    at 
> org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:237)
>    at 
> org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:89)
>    at com.gebatech.curator.GoCurator.main(GoCurator.java:26) {code}
>  
> I know this issue falls between Zookeeper and Curator but thought I'd report 
> it in case you can think of a solution. I think ideally they'd send a single 
> expiry event with a death flag set or, performance impact allowing, 
> synchronize that method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to