Shawyeok opened a new issue, #18269:
URL: https://github.com/apache/pulsar/issues/18269

   ### Search before asking
   
   - [X] I searched in the [issues](https://github.com/apache/pulsar/issues) 
and found nothing similar.
   
   
   ### Version
   
   ```shell
   # pulsar version: 2.8.1
   
   # cat /etc/redhat-release
   CentOS Linux release 7.9.2009 (Core)
   # java -version
   java version "1.8.0_121"
   Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
   Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
   # uname  -r
   3.10.0-123.el7.x86_64
   ```
   
   ### Minimal reproduce step
   
   Setup a pulsar cluster and zk cluster, could get a chance to reproduce (not 
always) after stop zk leader
   
   ### What did you expect to see?
   
   zk client should reconnect another zk node properly, 
   
   ### What did you see instead?
   
   `main-EventThread` is blocking on 
`org.apache.pulsar.metadata.impl.ZKSessionWatcher#process` method.
   ```java
   "metadata-store-zk-session-watcher-7-1" #16 prio=5 os_prio=0 
tid=0x00007fd7ca112000 nid=0x6991 waiting on condition [0x00007fd742afa000]
      java.lang.Thread.State: TIMED_WAITING (parking)
           at sun.misc.Unsafe.park(Native Method)
           - parking to wait for  <0x00000007953debd8> (a 
java.util.concurrent.CompletableFuture$Signaller)
           at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
           at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1695)
           at 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
           at 
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1775)
           at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
           at 
org.apache.pulsar.metadata.impl.ZKSessionWatcher.checkConnectionStatus(ZKSessionWatcher.java:104)
           - locked <0x00000006c66130e0> (a 
org.apache.pulsar.metadata.impl.ZKSessionWatcher)
           at 
org.apache.pulsar.metadata.impl.ZKSessionWatcher$$Lambda$38/909132503.run(Unknown
 Source)
           at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
           at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
           at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
           at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
           at java.lang.Thread.run(Thread.java:748)
   
   "main-EventThread" #15 daemon prio=5 os_prio=0 tid=0x00007fd7ca108800 
nid=0x6990 waiting for monitor entry [0x00007fd768380000]
      java.lang.Thread.State: BLOCKED (on object monitor)
           at 
org.apache.pulsar.metadata.impl.ZKSessionWatcher.process(ZKSessionWatcher.java:120)
           - waiting to lock <0x00000006c66130e0> (a 
org.apache.pulsar.metadata.impl.ZKSessionWatcher)
           at 
org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$new$0(ZKMetadataStore.java:75)
           at 
org.apache.pulsar.metadata.impl.ZKMetadataStore$$Lambda$35/1305486145.process(Unknown
 Source)
           at 
org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase.notifyEvent(ZooKeeperWatcherBase.java:180)
           at 
org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase.process(ZooKeeperWatcherBase.java:146)
           at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:588)
           at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:563)
   ```
   
   `main-EventThread` log:
   <img width="1325" alt="image" 
src="https://user-images.githubusercontent.com/5058708/198993764-10cb9a13-db90-43b8-a709-d8d9d3e6687d.png";>
   
   `metadata-store-zk-session-watcher-7-1` log:
   <img width="1323" alt="image" 
src="https://user-images.githubusercontent.com/5058708/198994044-4bb721d4-58f9-452e-abb6-6a1b524e0a25.png";>
   
   `org.apache.pulsar.metadata.impl.ZKSessionWatcher#currentStatus` is 
`SessionLost`
   <img width="836" alt="image" 
src="https://user-images.githubusercontent.com/5058708/198994498-80b3f49e-a615-40b4-b14d-273a56cef366.png";>
   
   zk EventThread internal task queue `waitingEvents` has many events to process
   <img width="1307" alt="image" 
src="https://user-images.githubusercontent.com/5058708/198994634-1c6d700b-0aa5-42bf-8990-cb37c892239d.png";>
   
   The keeperState of current event that zk EventThread processing is 
`SyncConnected`
   <img width="1388" alt="image" 
src="https://user-images.githubusercontent.com/5058708/198995351-8bf5ad06-cea2-46fc-9065-ec8068e983fb.png";>
   
   It seems that zk exists operation will always timeout cause zk EventThread 
is blocking and `metadata-store-zk-session-watcher` thread can always acquire 
lock (I don't understand why, a jvm bug?).
   
https://github.com/apache/pulsar/blob/0866c3a6a734b39402a6bc8349bab13edab00488/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/impl/ZKSessionWatcher.java#L68-L71
   
https://github.com/apache/pulsar/blob/0866c3a6a734b39402a6bc8349bab13edab00488/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/impl/ZKSessionWatcher.java#L86-L108
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [X] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to