waixiaoyu opened a new issue #6252: Deadlock may be in TaskMaster when stopping URL: https://github.com/apache/incubator-druid/issues/6252 When I try to close the whole druid cluster, I found the Overlord process still remained in the system. The following info is part of stack. "Thread-71" #160 prio=5 os_prio=0 tid=0x0000000006bf1000 nid=0xd918f waiting on condition [0x00007f4a38b4a000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000080bed6a8> (a java.util.concurrent.locks.ReentrantLock$FairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at io.druid.indexing.overlord.TaskMaster.stop(TaskMaster.java:191) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.stop(Lifecycle.java:434) at io.druid.java.util.common.lifecycle.Lifecycle.stop(Lifecycle.java:335) at io.druid.java.util.common.lifecycle.Lifecycle$1.run(Lifecycle.java:366) at java.lang.Thread.run(Thread.java:748) "LeaderSelector[/druid/overlord/_OVERLORD]" #161 daemon prio=5 os_prio=0 tid=0x0000000002686800 nid=0xd1ff5 in Object.wait() [0x00007f4a39350000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at io.druid.indexing.overlord.RemoteTaskRunner.start(RemoteTaskRunner.java:327) - locked <0x00000000807d05b0> (a java.lang.Object) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:413) at io.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:311) at io.druid.indexing.overlord.TaskMaster$1.becomeLeader(TaskMaster.java:141) at io.druid.curator.discovery.CuratorDruidLeaderSelector$1.isLeader(CuratorDruidLeaderSelector.java:91) at org.apache.curator.framework.recipes.leader.LeaderLatch$10.apply(LeaderLatch.java:703) at org.apache.curator.framework.recipes.leader.LeaderLatch$10.apply(LeaderLatch.java:699) at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) "main" #1 prio=5 os_prio=0 tid=0x0000000001b39000 nid=0xa0f02 in Object.wait() [0x00007f4a62dd4000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x000000008001b460> (a java.lang.Thread) at java.lang.Thread.join(Thread.java:1252) - locked <0x000000008001b460> (a java.lang.Thread) at java.lang.Thread.join(Thread.java:1326) at io.druid.java.util.common.lifecycle.Lifecycle.join(Lifecycle.java:377) at io.druid.cli.ServerRunnable.run(ServerRunnable.java:53) at io.druid.cli.Main.main(Main.java:116) =================================== From above trace, when the program stopping, it stuck in TaskMaster.java:191, when it need to get a ReentrantLock. But unfortunately, this node (overlord process) becomes a leader, it had got the Lock before and also stuck in RemoteTaskRunner.java:327. Actually in this time, the whole system is try to stop, and no other signal, maybe from Zookeeper, can invoke this thread. Or in some other abnormal scene, the program could stuck in RemoteTaskRunner.java:327. So, in this scene, no matter why it stuck in RemoteTaskRunner.java:327 (it looks like another deadlock scene, I met several times before ), the stop method cannot acquire the same ReentrantLock, and program will pause here forever. Technically, I just want to stop everything at now, so maybe the Lock in stop method is unnecessary. Or else, using LifecycleLock in RemoteTaskRunner.java instead of ReentrantLock looks like a better practice here.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org