[ 
https://issues.apache.org/jira/browse/CURATOR-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043066#comment-17043066
 ] 

Grant Digby commented on CURATOR-561:
-------------------------------------

Sounds reasonable, if the flag is associated with the instance of EventThread 
then we'd know it's not spawned a new one since the last call to inject session 
so can just call reset directly. I'm not sufficiently familiar with Curator to 
know the implications but happy to test it and report back for my use case when 
ready.

Strange you couldn't reproduce it. The key is for the curator inject sesison 
thread to set the state to expired but not proceed further. Then the SendThread 
tries to post the expired state but returns when it finds the state is already 
expired, so it just posts the death event causing the event thread to shut 
down. After that the curator inject sesson thread can carry on. I could capture 
a video of the repro and post it somewhere?!

Thanks for looking into this.

> Race condition preventing reconnection
> --------------------------------------
>
>                 Key: CURATOR-561
>                 URL: https://issues.apache.org/jira/browse/CURATOR-561
>             Project: Apache Curator
>          Issue Type: Bug
>    Affects Versions: 4.2.0
>            Reporter: Grant Digby
>            Assignee: Jordan Zimmerman
>            Priority: Major
>
> Heavy GC caused several rounds of suspend, lost, reconnected - as you'd 
> expect. But eventually it failed to reconnect, even though ZK was up the 
> entire time. My thread waited on acquireLock until the retry period was up. I 
> believe I've tracked this down to a race condition between the Curator thread 
> which calls ZookKeeperTestable#injectSessionExpiration to post the expiry and 
> death events and the Zookeeper SendThread which posts the same.
>  
> The EventThread is meant to receive the expiry event first so it can set up 
> its replacement, then it receives the death event making it shut down.
> The race condition allows these two events to be received in the wrong order 
> so the EventThread shuts down before a replacement can be set up.
>  
> The "Curator-ConnectionStateManager" thread calls injectSessionExpiration 
> which calls EventThread#queueEvent. It goes as far as my comment in the next 
> code snippit, sets the status and let's say it pauses here, for whatever 
> reason.
>  
> {code:java}
> private void queueEvent(WatchedEvent event,
>         Set<Watcher> materializedWatchers) {
>     if (event.getType() == EventType.None
>             && sessionState == event.getState()) {
>         return;
>     }
>     sessionState = event.getState();   // The curator thread sets the status 
> here
>   // It then goes onto send the expiry event, but for now it's here..the 
> event has not yet been sent{code}
>  
> The SendThread then goes into the same method with its Expiry message, but 
> the state has already been set to expired by the curator thread so it returns 
> without sending the expiry message. It then sends the event of death:
> {code:java}
> // from ClientCnxn#onConnected
> eventThread.queueEvent(new WatchedEvent(
>         Watcher.Event.EventType.None,
>         Watcher.Event.KeeperState.Expired, null));
> eventThread.queueEventOfDeath(); {code}
>  
> The EventThread receives this event of death and shuts down. The curator 
> thread then continues and sends the original expiry message but it's now too 
> late. With no EventThread created it remains in suspended state with any 
> operations just retrying until they give up.
>  
> I find it difficult to explan race conditions, but hopefully that made sense. 
> I can recreate the issue in Intellij using Suspend-All break points to cause 
> the session expiry and Suspend-Thread break points to order the threads in 
> the correct way:
>  
> Run the following and put a suspend-all break point at the System.out
> {code:java}
> package com.gebatech.curator;
> import org.apache.curator.framework.CuratorFramework;
> import org.apache.curator.framework.CuratorFrameworkFactory;
> import org.apache.curator.framework.recipes.locks.InterProcessReadWriteLock;
> import org.apache.curator.retry.BoundedExponentialBackoffRetry;
> /**
>  *
>  */
> public class GoCurator {
>     public static void main(String[] args) throws Exception {
>         CuratorFramework cf = CuratorFrameworkFactory.builder()
>                 .connectString("localhost:2181")
>                 .retryPolicy(new BoundedExponentialBackoffRetry(200, 10000, 
> 20))
>                 .sessionTimeoutMs(30 * 1000)
>                 .build();
>         cf.start();
>         InterProcessReadWriteLock lcok = new InterProcessReadWriteLock(cf, 
> "/grant/myLock");
>         System.out.println("Set a Suspend-ALL break point here and leave it 
> until the session expires, 30 seconds ish");
>         lcok.readLock().acquire();
>         System.out.println();
>     }
> } {code}
> Whlst suspended here, put a Suspend-Thread break poing in 
> EventThread#queueEvent, on the if statement from the first code snippet.
>  
> When you see the following log line, you can let the program resume;
> {code:java}
> 2020-02-20 20:15:05,912 [myid:] - INFO [SessionTracker:ZooKeeperServer@398] - 
> Expiring session 0x1006300f2390000, timeout of 30000ms exceeded {code}
> The SendThread will hit your break point first with a Disconnect event, you 
> can let this run.
> Next you'll get the SendThread with the Expiry event, wait here until you get 
> the popup telling you that the Curator thread has hit the same break point. 
> Switch to this and step through until it has set the expiry status, then 
> leave it held there and switch back to the SendThread. Let the sendThread 
> run, then remove break points and let everything else run.
>  
> The program will then be stuck and you'll see this stacktrace for the acquire 
> lock method
> {code:java}
> "main@1" prio=5 tid=0x1 nid=NA waiting"main@1" prio=5 tid=0x1 nid=NA waiting  
> java.lang.Thread.State: WAITING   at java.lang.Object.wait(Object.java:-1)   
> at java.lang.Object.wait(Object.java:502)   at 
> org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1499)   at 
> org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1487)   at 
> org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1547)   at 
> org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180)
>    at 
> org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156)
>    at 
> org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
>    at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)   at 
> org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153)
>    at 
> org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:607)
>    at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:597)
>    at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:575)
>    at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:51)
>    at 
> org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54)
>    at 
> org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:225)
>    at 
> org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:237)
>    at 
> org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:89)
>    at com.gebatech.curator.GoCurator.main(GoCurator.java:26) {code}
>  
> I know this issue falls between Zookeeper and Curator but thought I'd report 
> it in case you can think of a solution. I think ideally they'd send a single 
> expiry event with a death flag set or, performance impact allowing, 
> synchronize that method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to