Stephen Schaeffer created AMQ-6095: -------------------------------------- Summary: Deadlock in failover environment. Key: AMQ-6095 URL: https://issues.apache.org/jira/browse/AMQ-6095 Project: ActiveMQ Issue Type: Bug Affects Versions: 5.12.0 Reporter: Stephen Schaeffer
Hi all, We have an environment as follows: ActiveMQ 5.12.0 on 3 nodes using Zookeeper Zookeeper 3.4.6 on the same 3 nodes. Java 1.8 RHEL Server 7.1 We can start up and verify that ActiveMQ failover is working by sending and consuming messages from different machines while taking ActiveMQ nodes up and down, and everything looks fine. Then, after some indeterminate amount of time, things stop working and jstack turns up this: Found one Java-level deadlock: ============================= "ActiveMQ BrokerService[activeMqBroker] Task-26": waiting to lock monitor 0x00007f4520004e68 (object 0x00000000d5cbfe80, a org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup), which is held by "ZooKeeper state change dispatcher thread" "ZooKeeper state change dispatcher thread": waiting to lock monitor 0x00007f451c00ee38 (object 0x00000000d5cf1e80, a org.apache.activemq.leveldb.replicated.MasterElector), which is held by "ActiveMQ BrokerService[activeMqBroker] Task-25" "ActiveMQ BrokerService[activeMqBroker] Task-25": waiting to lock monitor 0x00007f4520004e68 (object 0x00000000d5cbfe80, a org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup), which is held by "ZooKeeper state change dispatcher thread" Java stack information for the threads listed above: =================================================== "ActiveMQ BrokerService[activeMqBroker] Task-26": at org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup.close(ZooKeeperGroup.scala:100) - waiting to lock <0x00000000d5cbfe80> (a org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup) at org.apache.activemq.leveldb.replicated.ElectingLevelDBStore.doStop(ElectingLevelDBStore.scala:282) at org.apache.activemq.util.ServiceSupport.stop(ServiceSupport.java:71) at org.apache.activemq.util.ServiceStopper.stop(ServiceStopper.java:41) at org.apache.activemq.broker.BrokerService.stop(BrokerService.java:806) at org.apache.activemq.xbean.XBeanBrokerService.stop(XBeanBrokerService.java:122) at org.apache.activemq.leveldb.replicated.ElectingLevelDBStore$$anonfun$stop_master$2.apply$mcV$sp(ElectingLevelDBStore.scala:259) at org.fusesource.hawtdispatch.package$$anon$4.run(hawtdispatch.scala:330) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) "ZooKeeper state change dispatcher thread": at org.apache.activemq.leveldb.replicated.groups.ClusteredSingletonWatcher.changed_decoded(ClusteredSingleton.scala:155) - waiting to lock <0x00000000d5cf1e80> (a org.apache.activemq.leveldb.replicated.MasterElector) at org.apache.activemq.leveldb.replicated.groups.ClusteredSingletonWatcher$$anon$2.changed(ClusteredSingleton.scala:108) at org.apache.activemq.leveldb.replicated.groups.ChangeListenerSupport$$anonfun$fireChanged$1$$anonfun$apply$mcV$sp$3.apply(ChangeListener.scala:89) at org.apache.activemq.leveldb.replicated.groups.ChangeListenerSupport$$anonfun$fireChanged$1$$anonfun$apply$mcV$sp$3.apply(ChangeListener.scala:88) at scala.collection.immutable.List.foreach(List.scala:383) at org.apache.activemq.leveldb.replicated.groups.ChangeListenerSupport$$anonfun$fireChanged$1.apply$mcV$sp(ChangeListener.scala:88) at org.apache.activemq.leveldb.replicated.groups.ChangeListenerSupport$$anonfun$fireChanged$1.apply(ChangeListener.scala:88) at org.apache.activemq.leveldb.replicated.groups.ChangeListenerSupport$$anonfun$fireChanged$1.apply(ChangeListener.scala:88) at org.apache.activemq.leveldb.replicated.groups.ChangeListenerSupport$class.check_elapsed_time(ChangeListener.scala:97) at org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup.check_elapsed_time(ZooKeeperGroup.scala:73) at org.apache.activemq.leveldb.replicated.groups.ChangeListenerSupport$class.fireChanged(ChangeListener.scala:87) at org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup.fireChanged(ZooKeeperGroup.scala:73) at org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup.org$apache$activemq$leveldb$replicated$groups$ZooKeeperGroup$$fire_cluster_change(ZooKeeperGroup.scala:182) at org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup$$anon$1.onEvents(ZooKeeperGroup.scala:90) at org.linkedin.zookeeper.tracker.ZooKeeperTreeTracker.raiseEvents(ZooKeeperTreeTracker.java:402) at org.linkedin.zookeeper.tracker.ZooKeeperTreeTracker.track(ZooKeeperTreeTracker.java:240) at org.linkedin.zookeeper.tracker.ZooKeeperTreeTracker.track(ZooKeeperTreeTracker.java:228) at org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup.onConnected(ZooKeeperGroup.scala:124) - locked <0x00000000d5cbfe80> (a org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup) at org.apache.activemq.leveldb.replicated.groups.ZKClient.callListeners(ZKClient.java:385) at org.apache.activemq.leveldb.replicated.groups.ZKClient$StateChangeDispatcher.run(ZKClient.java:354) "ActiveMQ BrokerService[activeMqBroker] Task-25": at org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup.update(ZooKeeperGroup.scala:143) - waiting to lock <0x00000000d5cbfe80> (a org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup) at org.apache.activemq.leveldb.replicated.groups.ClusteredSingleton.join(ClusteredSingleton.scala:212) - locked <0x00000000d5cf1e80> (a org.apache.activemq.leveldb.replicated.MasterElector) at org.apache.activemq.leveldb.replicated.MasterElector.update(MasterElector.scala:90) - locked <0x00000000d5cf1e80> (a org.apache.activemq.leveldb.replicated.MasterElector) at org.apache.activemq.leveldb.replicated.MasterElector$change_listener$.changed(MasterElector.scala:243) - locked <0x00000000d5cf1e80> (a org.apache.activemq.leveldb.replicated.MasterElector) at org.apache.activemq.leveldb.replicated.MasterElector$change_listener$$anonfun$changed$1.apply$mcV$sp(MasterElector.scala:191) - locked <0x00000000d5cf1e80> (a org.apache.activemq.leveldb.replicated.MasterElector) at org.apache.activemq.leveldb.replicated.ElectingLevelDBStore$$anonfun$stop_master$1.apply$mcV$sp(ElectingLevelDBStore.scala:252) at org.fusesource.hawtdispatch.package$$anon$4.run(hawtdispatch.scala:330) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Found 1 deadlock. For what it's worth, we're not sending a huge amount of data around. Also, once the 03 node was bounced, traffic resumed as normal. -- This message was sent by Atlassian JIRA (v6.3.4#6332)