[ https://issues.apache.org/jira/browse/CURATOR-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043067#comment-17043067 ]
Cam McKenzie commented on CURATOR-561: -------------------------------------- I'm just trying to reproduce this at the moment. I'm positive that I debugged something very similar to this years ago, but never got to the bottom of it. I think that the proposed solution should work, and agree that the flag should be bound to the EventThread instance. > Race condition preventing reconnection > -------------------------------------- > > Key: CURATOR-561 > URL: https://issues.apache.org/jira/browse/CURATOR-561 > Project: Apache Curator > Issue Type: Bug > Affects Versions: 4.2.0 > Reporter: Grant Digby > Assignee: Jordan Zimmerman > Priority: Major > > Heavy GC caused several rounds of suspend, lost, reconnected - as you'd > expect. But eventually it failed to reconnect, even though ZK was up the > entire time. My thread waited on acquireLock until the retry period was up. I > believe I've tracked this down to a race condition between the Curator thread > which calls ZookKeeperTestable#injectSessionExpiration to post the expiry and > death events and the Zookeeper SendThread which posts the same. > > The EventThread is meant to receive the expiry event first so it can set up > its replacement, then it receives the death event making it shut down. > The race condition allows these two events to be received in the wrong order > so the EventThread shuts down before a replacement can be set up. > > The "Curator-ConnectionStateManager" thread calls injectSessionExpiration > which calls EventThread#queueEvent. It goes as far as my comment in the next > code snippit, sets the status and let's say it pauses here, for whatever > reason. > > {code:java} > private void queueEvent(WatchedEvent event, > Set<Watcher> materializedWatchers) { > if (event.getType() == EventType.None > && sessionState == event.getState()) { > return; > } > sessionState = event.getState(); // The curator thread sets the status > here > // It then goes onto send the expiry event, but for now it's here..the > event has not yet been sent{code} > > The SendThread then goes into the same method with its Expiry message, but > the state has already been set to expired by the curator thread so it returns > without sending the expiry message. It then sends the event of death: > {code:java} > // from ClientCnxn#onConnected > eventThread.queueEvent(new WatchedEvent( > Watcher.Event.EventType.None, > Watcher.Event.KeeperState.Expired, null)); > eventThread.queueEventOfDeath(); {code} > > The EventThread receives this event of death and shuts down. The curator > thread then continues and sends the original expiry message but it's now too > late. With no EventThread created it remains in suspended state with any > operations just retrying until they give up. > > I find it difficult to explan race conditions, but hopefully that made sense. > I can recreate the issue in Intellij using Suspend-All break points to cause > the session expiry and Suspend-Thread break points to order the threads in > the correct way: > > Run the following and put a suspend-all break point at the System.out > {code:java} > package com.gebatech.curator; > import org.apache.curator.framework.CuratorFramework; > import org.apache.curator.framework.CuratorFrameworkFactory; > import org.apache.curator.framework.recipes.locks.InterProcessReadWriteLock; > import org.apache.curator.retry.BoundedExponentialBackoffRetry; > /** > * > */ > public class GoCurator { > public static void main(String[] args) throws Exception { > CuratorFramework cf = CuratorFrameworkFactory.builder() > .connectString("localhost:2181") > .retryPolicy(new BoundedExponentialBackoffRetry(200, 10000, > 20)) > .sessionTimeoutMs(30 * 1000) > .build(); > cf.start(); > InterProcessReadWriteLock lcok = new InterProcessReadWriteLock(cf, > "/grant/myLock"); > System.out.println("Set a Suspend-ALL break point here and leave it > until the session expires, 30 seconds ish"); > lcok.readLock().acquire(); > System.out.println(); > } > } {code} > Whlst suspended here, put a Suspend-Thread break poing in > EventThread#queueEvent, on the if statement from the first code snippet. > > When you see the following log line, you can let the program resume; > {code:java} > 2020-02-20 20:15:05,912 [myid:] - INFO [SessionTracker:ZooKeeperServer@398] - > Expiring session 0x1006300f2390000, timeout of 30000ms exceeded {code} > The SendThread will hit your break point first with a Disconnect event, you > can let this run. > Next you'll get the SendThread with the Expiry event, wait here until you get > the popup telling you that the Curator thread has hit the same break point. > Switch to this and step through until it has set the expiry status, then > leave it held there and switch back to the SendThread. Let the sendThread > run, then remove break points and let everything else run. > > The program will then be stuck and you'll see this stacktrace for the acquire > lock method > {code:java} > "main@1" prio=5 tid=0x1 nid=NA waiting"main@1" prio=5 tid=0x1 nid=NA waiting > java.lang.Thread.State: WAITING at java.lang.Object.wait(Object.java:-1) > at java.lang.Object.wait(Object.java:502) at > org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1499) at > org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1487) at > org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1547) at > org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180) > at > org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156) > at > org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64) > at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100) at > org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153) > at > org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:607) > at > org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:597) > at > org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:575) > at > org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:51) > at > org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54) > at > org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:225) > at > org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:237) > at > org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:89) > at com.gebatech.curator.GoCurator.main(GoCurator.java:26) {code} > > I know this issue falls between Zookeeper and Curator but thought I'd report > it in case you can think of a solution. I think ideally they'd send a single > expiry event with a death flag set or, performance impact allowing, > synchronize that method. -- This message was sent by Atlassian Jira (v8.3.4#803005)