[
https://issues.apache.org/jira/browse/CURATOR-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438250#comment-17438250
]
Scott Kirkpatrick commented on CURATOR-561:
-------------------------------------------
[~randgalt] [~cammckenzie] I tried to make a patch for this:
https://github.com/apache/curator/pull/399 . I've been seeing this issue in
production every month or so and it's been a real headache. I was able to make
a test that reproduced the issue, albeit it relies on some knowledge of the
internal code of ZooKeeper.
> Race condition preventing reconnection
> --------------------------------------
>
> Key: CURATOR-561
> URL: https://issues.apache.org/jira/browse/CURATOR-561
> Project: Apache Curator
> Issue Type: Bug
> Affects Versions: 4.2.0
> Reporter: Grant Digby
> Assignee: Jordan Zimmerman
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Heavy GC caused several rounds of suspend, lost, reconnected - as you'd
> expect. But eventually it failed to reconnect, even though ZK was up the
> entire time. My thread waited on acquireLock until the retry period was up. I
> believe I've tracked this down to a race condition between the Curator thread
> which calls ZookKeeperTestable#injectSessionExpiration to post the expiry and
> death events and the Zookeeper SendThread which posts the same.
>
> The EventThread is meant to receive the expiry event first so it can set up
> its replacement, then it receives the death event making it shut down.
> The race condition allows these two events to be received in the wrong order
> so the EventThread shuts down before a replacement can be set up.
>
> The "Curator-ConnectionStateManager" thread calls injectSessionExpiration
> which calls EventThread#queueEvent. It goes as far as my comment in the next
> code snippit, sets the status and let's say it pauses here, for whatever
> reason.
>
> {code:java}
> private void queueEvent(WatchedEvent event,
> Set<Watcher> materializedWatchers) {
> if (event.getType() == EventType.None
> && sessionState == event.getState()) {
> return;
> }
> sessionState = event.getState(); // The curator thread sets the status
> here
> // It then goes onto send the expiry event, but for now it's here..the
> event has not yet been sent{code}
>
> The SendThread then goes into the same method with its Expiry message, but
> the state has already been set to expired by the curator thread so it returns
> without sending the expiry message. It then sends the event of death:
> {code:java}
> // from ClientCnxn#onConnected
> eventThread.queueEvent(new WatchedEvent(
> Watcher.Event.EventType.None,
> Watcher.Event.KeeperState.Expired, null));
> eventThread.queueEventOfDeath(); {code}
>
> The EventThread receives this event of death and shuts down. The curator
> thread then continues and sends the original expiry message but it's now too
> late. With no EventThread created it remains in suspended state with any
> operations just retrying until they give up.
>
> I find it difficult to explan race conditions, but hopefully that made sense.
> I can recreate the issue in Intellij using Suspend-All break points to cause
> the session expiry and Suspend-Thread break points to order the threads in
> the correct way:
>
> Run the following and put a suspend-all break point at the System.out
> {code:java}
> package com.gebatech.curator;
> import org.apache.curator.framework.CuratorFramework;
> import org.apache.curator.framework.CuratorFrameworkFactory;
> import org.apache.curator.framework.recipes.locks.InterProcessReadWriteLock;
> import org.apache.curator.retry.BoundedExponentialBackoffRetry;
> /**
> *
> */
> public class GoCurator {
> public static void main(String[] args) throws Exception {
> CuratorFramework cf = CuratorFrameworkFactory.builder()
> .connectString("localhost:2181")
> .retryPolicy(new BoundedExponentialBackoffRetry(200, 10000,
> 20))
> .sessionTimeoutMs(30 * 1000)
> .build();
> cf.start();
> InterProcessReadWriteLock lcok = new InterProcessReadWriteLock(cf,
> "/grant/myLock");
> System.out.println("Set a Suspend-ALL break point here and leave it
> until the session expires, 30 seconds ish");
> lcok.readLock().acquire();
> System.out.println();
> }
> } {code}
> Whlst suspended here, put a Suspend-Thread break poing in
> EventThread#queueEvent, on the if statement from the first code snippet.
>
> When you see the following log line, you can let the program resume;
> {code:java}
> 2020-02-20 20:15:05,912 [myid:] - INFO [SessionTracker:ZooKeeperServer@398] -
> Expiring session 0x1006300f2390000, timeout of 30000ms exceeded {code}
> The SendThread will hit your break point first with a Disconnect event, you
> can let this run.
> Next you'll get the SendThread with the Expiry event, wait here until you get
> the popup telling you that the Curator thread has hit the same break point.
> Switch to this and step through until it has set the expiry status, then
> leave it held there and switch back to the SendThread. Let the sendThread
> run, then remove break points and let everything else run.
>
> The program will then be stuck and you'll see this stacktrace for the acquire
> lock method
> {code:java}
> "main@1" prio=5 tid=0x1 nid=NA waiting"main@1" prio=5 tid=0x1 nid=NA waiting
> java.lang.Thread.State: WAITING at java.lang.Object.wait(Object.java:-1)
> at java.lang.Object.wait(Object.java:502) at
> org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1499) at
> org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1487) at
> org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1547) at
> org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180)
> at
> org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156)
> at
> org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
> at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100) at
> org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153)
> at
> org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:607)
> at
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:597)
> at
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:575)
> at
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:51)
> at
> org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54)
> at
> org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:225)
> at
> org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:237)
> at
> org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:89)
> at com.gebatech.curator.GoCurator.main(GoCurator.java:26) {code}
>
> I know this issue falls between Zookeeper and Curator but thought I'd report
> it in case you can think of a solution. I think ideally they'd send a single
> expiry event with a death flag set or, performance impact allowing,
> synchronize that method.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)