Stephen Schaeffer created AMQ-6095:
--------------------------------------

             Summary: Deadlock in failover environment.
                 Key: AMQ-6095
                 URL: https://issues.apache.org/jira/browse/AMQ-6095
             Project: ActiveMQ
          Issue Type: Bug
    Affects Versions: 5.12.0
            Reporter: Stephen Schaeffer


Hi all, 

We have an environment as follows: 
  ActiveMQ 5.12.0 on 3 nodes using Zookeeper 
  Zookeeper 3.4.6 on the same 3 nodes. 
  Java 1.8 
  RHEL Server 7.1 

We can start up and verify that ActiveMQ failover is working by sending and 
consuming messages from different machines while taking ActiveMQ nodes up and 
down, and everything looks fine. 

Then, after some indeterminate amount of time, things stop working and jstack 
turns up this: 

Found one Java-level deadlock: 
============================= 
"ActiveMQ BrokerService[activeMqBroker] Task-26": 
  waiting to lock monitor 0x00007f4520004e68 (object 0x00000000d5cbfe80, a 
org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup), 
  which is held by "ZooKeeper state change dispatcher thread" 
"ZooKeeper state change dispatcher thread": 
  waiting to lock monitor 0x00007f451c00ee38 (object 0x00000000d5cf1e80, a 
org.apache.activemq.leveldb.replicated.MasterElector), 
  which is held by "ActiveMQ BrokerService[activeMqBroker] Task-25" 
"ActiveMQ BrokerService[activeMqBroker] Task-25": 
  waiting to lock monitor 0x00007f4520004e68 (object 0x00000000d5cbfe80, a 
org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup), 
  which is held by "ZooKeeper state change dispatcher thread" 

Java stack information for the threads listed above: 
=================================================== 
"ActiveMQ BrokerService[activeMqBroker] Task-26": 
        at 
org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup.close(ZooKeeperGroup.scala:100)
 
        - waiting to lock <0x00000000d5cbfe80> (a 
org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup) 
        at 
org.apache.activemq.leveldb.replicated.ElectingLevelDBStore.doStop(ElectingLevelDBStore.scala:282)
 
        at org.apache.activemq.util.ServiceSupport.stop(ServiceSupport.java:71) 
        at org.apache.activemq.util.ServiceStopper.stop(ServiceStopper.java:41) 
        at 
org.apache.activemq.broker.BrokerService.stop(BrokerService.java:806) 
        at 
org.apache.activemq.xbean.XBeanBrokerService.stop(XBeanBrokerService.java:122) 
        at 
org.apache.activemq.leveldb.replicated.ElectingLevelDBStore$$anonfun$stop_master$2.apply$mcV$sp(ElectingLevelDBStore.scala:259)
 
        at 
org.fusesource.hawtdispatch.package$$anon$4.run(hawtdispatch.scala:330) 
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
        at java.lang.Thread.run(Thread.java:745) 
"ZooKeeper state change dispatcher thread": 
        at 
org.apache.activemq.leveldb.replicated.groups.ClusteredSingletonWatcher.changed_decoded(ClusteredSingleton.scala:155)
 
        - waiting to lock <0x00000000d5cf1e80> (a 
org.apache.activemq.leveldb.replicated.MasterElector) 
        at 
org.apache.activemq.leveldb.replicated.groups.ClusteredSingletonWatcher$$anon$2.changed(ClusteredSingleton.scala:108)
 
        at 
org.apache.activemq.leveldb.replicated.groups.ChangeListenerSupport$$anonfun$fireChanged$1$$anonfun$apply$mcV$sp$3.apply(ChangeListener.scala:89)
 
        at 
org.apache.activemq.leveldb.replicated.groups.ChangeListenerSupport$$anonfun$fireChanged$1$$anonfun$apply$mcV$sp$3.apply(ChangeListener.scala:88)
 
        at scala.collection.immutable.List.foreach(List.scala:383) 
        at 
org.apache.activemq.leveldb.replicated.groups.ChangeListenerSupport$$anonfun$fireChanged$1.apply$mcV$sp(ChangeListener.scala:88)
 
        at 
org.apache.activemq.leveldb.replicated.groups.ChangeListenerSupport$$anonfun$fireChanged$1.apply(ChangeListener.scala:88)
 
        at 
org.apache.activemq.leveldb.replicated.groups.ChangeListenerSupport$$anonfun$fireChanged$1.apply(ChangeListener.scala:88)
 
        at 
org.apache.activemq.leveldb.replicated.groups.ChangeListenerSupport$class.check_elapsed_time(ChangeListener.scala:97)
 
        at 
org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup.check_elapsed_time(ZooKeeperGroup.scala:73)
 
        at 
org.apache.activemq.leveldb.replicated.groups.ChangeListenerSupport$class.fireChanged(ChangeListener.scala:87)
 
        at 
org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup.fireChanged(ZooKeeperGroup.scala:73)
 
        at 
org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup.org$apache$activemq$leveldb$replicated$groups$ZooKeeperGroup$$fire_cluster_change(ZooKeeperGroup.scala:182)
 
        at 
org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup$$anon$1.onEvents(ZooKeeperGroup.scala:90)
 
        at 
org.linkedin.zookeeper.tracker.ZooKeeperTreeTracker.raiseEvents(ZooKeeperTreeTracker.java:402)
 
        at 
org.linkedin.zookeeper.tracker.ZooKeeperTreeTracker.track(ZooKeeperTreeTracker.java:240)
 
        at 
org.linkedin.zookeeper.tracker.ZooKeeperTreeTracker.track(ZooKeeperTreeTracker.java:228)
 
        at 
org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup.onConnected(ZooKeeperGroup.scala:124)
 
        - locked <0x00000000d5cbfe80> (a 
org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup) 
        at 
org.apache.activemq.leveldb.replicated.groups.ZKClient.callListeners(ZKClient.java:385)
 
        at 
org.apache.activemq.leveldb.replicated.groups.ZKClient$StateChangeDispatcher.run(ZKClient.java:354)
 
"ActiveMQ BrokerService[activeMqBroker] Task-25": 
        at 
org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup.update(ZooKeeperGroup.scala:143)
 
        - waiting to lock <0x00000000d5cbfe80> (a 
org.apache.activemq.leveldb.replicated.groups.ZooKeeperGroup) 
        at 
org.apache.activemq.leveldb.replicated.groups.ClusteredSingleton.join(ClusteredSingleton.scala:212)
 
        - locked <0x00000000d5cf1e80> (a 
org.apache.activemq.leveldb.replicated.MasterElector) 
        at 
org.apache.activemq.leveldb.replicated.MasterElector.update(MasterElector.scala:90)
 
        - locked <0x00000000d5cf1e80> (a 
org.apache.activemq.leveldb.replicated.MasterElector) 
        at 
org.apache.activemq.leveldb.replicated.MasterElector$change_listener$.changed(MasterElector.scala:243)
 
        - locked <0x00000000d5cf1e80> (a 
org.apache.activemq.leveldb.replicated.MasterElector) 
        at 
org.apache.activemq.leveldb.replicated.MasterElector$change_listener$$anonfun$changed$1.apply$mcV$sp(MasterElector.scala:191)
 
        - locked <0x00000000d5cf1e80> (a 
org.apache.activemq.leveldb.replicated.MasterElector) 
        at 
org.apache.activemq.leveldb.replicated.ElectingLevelDBStore$$anonfun$stop_master$1.apply$mcV$sp(ElectingLevelDBStore.scala:252)
 
        at 
org.fusesource.hawtdispatch.package$$anon$4.run(hawtdispatch.scala:330) 
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
        at java.lang.Thread.run(Thread.java:745) 

Found 1 deadlock. 

For what it's worth, we're not sending a huge amount of data around. Also, once 
the 03 node was bounced, traffic resumed as normal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to