Thanks for the reply! I forgot to include the thread dump that I have collected. This process has been hung for almost a day so I'm guessing it'll never connect properly ;-) We actually had 2 such processes hung today with the same stack trace (at least the same root cause as I show below). Please note that this problem is rare but supremely not good when it does happen if we fail to detect it. We've been running this code for many months now and this issue has only recently occurred.

Thread 4396: (state = BLOCKED)

- sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)

- java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame)

- java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt() @bci=1, line=834 (Interpreted frame)

- java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(int) @bci=72, line=994 (Interpreted frame)

- java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(int) @bci=24, line=1303 (Interpreted frame)

- java.util.concurrent.CountDownLatch.await() @bci=5, line=236 (Interpreted frame)

- com.mycode.ZooKeeperFactory.connect(java.lang.String, int) @bci=34, line=59 (Interpreted frame)
...
[remainder of stack trace omitted]

John


Michael Han wrote:
Sounds like a dead lock on client library. One idea is to instrument your
client code and dump the thread stack when the wait timeouts. The stack
will hopefully contain the states of various threads and provide some
insights on what to look for next.

On Tue, Jun 20, 2017 at 3:14 PM, John Lindwall<[email protected]>
wrote:

We are seeing some occasional incidents where a zookeeper java client will
hang in CountDownLatch.await() while waiting for a connection to be
established.  Our connect() code is pretty standard I think and it similar
to this:

     private ZooKeeper connect(String hosts, int sessionTimeout) throws
IOException, InterruptedException {
         final CountDownLatch connectedSignal = new CountDownLatch(1);

         ZooKeeper zk = new ZooKeeper(hosts, sessionTimeout, new Watcher() {
             @Override
             public void process(WatchedEvent event) {
                 if (event.getState() == Event.KeeperState.SyncConnected) {
                     connectedSignal.countDown();
                 }
             }
         });

         connectedSignal.await();
         return zk;
     }

Has anyone else had an issue with the await() blocking forever like this?
Any advice?

As a "fix" I am considering adding a timeout to the CountDownLatch await()
call; if we fail to connect within that timeout then retry the connection
attempt. After, say, 3 retries, give up entirely.

Thanks!
--
John Lindwall





--
John Lindwall

Reply via email to