Re: Potential Bug in Master-Slave with Replicated LevelDB Store
I spent time last week trying to tune the parallel GC to prevent any objects from reaching OldGen once the broker was up and running in a steady state, to try to avoid expensive full GCs. My goal was zero full GCs for a broker with 3-6 months of uptime, to prevent clients and other brokers from failing over from one broker to another. I increased the size of NewGen relative to OldGen, I increased the size of Survivor relative to Eden, and I tweaked a few other settings, and I was never able to avoid a slow stream of objects making it into OldGen that were deemed dead by the time a full GC happened (usually because I triggered it manually). I was able to reduce the rate of object promotion by about half, and full GCs would probably have been less painful when OldGen is only 5-10% of the total heap, so the changes should have made full GCs less frequent and less painful, but I wasn't able to eliminate them entirely. So I've given up on the parallel GC and I'm now tweaking G1 to make it behave as we'd like, and so far the results are far more promising than with the parallel GC. So I second Ulrich's recommendation to use G1 rather than parallel GC, even though the overhead of G1 is several times that of the parallel GC, if you're more interested in avoiding occasional lengthy pauses due to full GCs than in getting the highest possible throughput from your broker. On Tue, Oct 21, 2014 at 10:13 AM, Tim Bain tb...@alumni.duke.edu wrote: G1GC is great for reducing the duration of any single stop-the-world GC (and hence minimizing latency of any individual operation as well as avoiding timeouts), but the total time spent performing GCs (and hence the total amount of time the brokers are paused) is several times that of the parallel GC algorithm, based on some articles I read a couple weeks back. So although G1GC should work for a wide range (possibly all) of ActiveMQ memory usage patterns and may be the right option for you based on how your broker is used, you may get better overall throughput from sticking with ParallelGC but adjusting the ratio of YoungGen to OldGen to favor YoungGen (increasing the odds that a message gets GC'ed before it gets to OldGen) and the ratio of Eden to Survivor within YoungGen to favor Survivor (to increase the odds that a message can stick around in YoungGen long enough to die before it gets promoted to OldGen). But you have to be confident that your usage patterns won't allow OldGen to fill during the life of your broker's uptime (whether that's hours or years), otherwise you'll end up doing a long full GC and you'd probably have been better off going with G1GC. For our broker, we expire undelivered messages quickly (under a minute), so in theory expanding both YoungGen and Survivor might prevent anything from getting into OldGen and thus prevent long full GCs. I'm actually going to be doing this tuning this week, so I'll report out what I find, though obviously YMMV since everyone's message usage patterns are different. On Tue, Oct 21, 2014 at 5:25 AM, uromahn ulr...@ulrichromahn.net wrote: Another update: I ran the broker with the native Java LevelDB and found that I am still seeing the Warnings in the log file as reported before. However, to my surprise the broker seem to perform better and even slightly faster! I always thought the native LevelDB should be faster but I guess the access via JNI may be less optimal than using an embedded Java (or Scala) engine. -- View this message in context: http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686583.html Sent from the ActiveMQ - User mailing list archive at Nabble.com.
Re: Potential Bug in Master-Slave with Replicated LevelDB Store
Based on your suggestion, I looked at the GC behavior of the JVM, and you were 100% spot on. At the time amq1 gets demoted to slave which forces a failover to amq2 there was a stop-the-world GC going on. Also, I was able to make the failover work correctly with the second cluster in the network. In my first cluster consiting of amq1-3, I had my networkConnectors section identical. Each defined connector has a different name as suggested (I have defined 5 connectors to improve throughput by using 5 concurrent connections). However, after the failover the other side (amq4) was complaining that the connection already exists from amq1 and hence it rejected the conneciton from amq2. It looks that in case of such a failover, the connections from amq1 to amq4 won't get cleaned up. The work-around and solution was to give *every* connection from each cluster node (amq1, amq2, amq3) a unique name. So the networkConnectors section on amq1 looks like this: networkConnectors networkConnector name=link1a duplex=true conduitSubscriptions=false decreaseNetworkConsumerPriority= false uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/ networkConnector name=link2a duplex=true conduitSubscriptions=false decreaseNetworkConsumerPriority= false uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/ networkConnector name=link3a duplex=true conduitSubscriptions=false decreaseNetworkConsumerPriority= false uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/ networkConnector name=link4a duplex=true conduitSubscriptions=false decreaseNetworkConsumerPriority= false uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/ networkConnector name=link5a duplex=true conduitSubscriptions=false decreaseNetworkConsumerPriority= false uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/ /networkConnectors and the same section on amq2 like this: networkConnectors networkConnector name=link1b duplex=true conduitSubscriptions=false decreaseNetworkConsumerPriority= false uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/ networkConnector name=link2b duplex=true conduitSubscriptions=false decreaseNetworkConsumerPriority= false uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/ networkConnector name=link3b duplex=true conduitSubscriptions=false decreaseNetworkConsumerPriority= false uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/ networkConnector name=link4b duplex=true conduitSubscriptions=false decreaseNetworkConsumerPriority= false uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/ networkConnector name=link5b duplex=true conduitSubscriptions=false decreaseNetworkConsumerPriority= false uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/ /networkConnectors (notice the different names, e.g. link1b vs link1a) And similar on amq3 (which I spare you here). I ran several tests now and it looks like the failover is happening correctly with no messages getting lost. I had, however, a few cases where messages got delivered twice. For example, I sent 100,000 messages from my producer and the consumer actually received 100,043 messages. Although not ideal since I always will have to do duplicate checking, it is better than losing messages. One additional note: when the failover happens, the other active cluster node in the network (e.g. amq4) is quite often dumping all the messages it received but could not acknowledge to amq1 to the log. This is not really a good behavior since nobody would really scan through hundreds of lines in the log file to identify those messages. It would be better to setup another DLQ for that and dump the messages there rather than in the log file. I will run some more tests by changing the GC to G1 hopefully avoiding a full GC and the demotion of the broker to slave forcing a failover. -- View this message in context: http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686576.html Sent from the ActiveMQ - User mailing list archive at Nabble.com.
Re: Potential Bug in Master-Slave with Replicated LevelDB Store
Quick update: I have enabled G1GC for the JVM running the broker and since then had no problem again. The master broker stays master even under very heavy load. So, my suggestion and recommendation when using replicated LevelDB would be to use the G1 garbage collector significantly reducing stop-the-world GCs causing a time-out on the connection to Zookeeper which will ultimately demote a broker to Slave without real reason. However, while running my tests, I noticed the following Warnings in the log files of both slaves: 2014-10-21 09:44:37,694 | WARN | Invalid log position: 1569963680 | org.apache.activemq.leveldb.LevelDBClient | Thread-2 There are probably hundreds of those (didn't actually count them) with obviously changing log positions. I will retry my tests by switching to the Java implementation of LevelDB. -- View this message in context: http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686580.html Sent from the ActiveMQ - User mailing list archive at Nabble.com.
Re: Potential Bug in Master-Slave with Replicated LevelDB Store
Another update: I ran the broker with the native Java LevelDB and found that I am still seeing the Warnings in the log file as reported before. However, to my surprise the broker seem to perform better and even slightly faster! I always thought the native LevelDB should be faster but I guess the access via JNI may be less optimal than using an embedded Java (or Scala) engine. -- View this message in context: http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686583.html Sent from the ActiveMQ - User mailing list archive at Nabble.com.
Re: Potential Bug in Master-Slave with Replicated LevelDB Store
G1GC is great for reducing the duration of any single stop-the-world GC (and hence minimizing latency of any individual operation as well as avoiding timeouts), but the total time spent performing GCs (and hence the total amount of time the brokers are paused) is several times that of the parallel GC algorithm, based on some articles I read a couple weeks back. So although G1GC should work for a wide range (possibly all) of ActiveMQ memory usage patterns and may be the right option for you based on how your broker is used, you may get better overall throughput from sticking with ParallelGC but adjusting the ratio of YoungGen to OldGen to favor YoungGen (increasing the odds that a message gets GC'ed before it gets to OldGen) and the ratio of Eden to Survivor within YoungGen to favor Survivor (to increase the odds that a message can stick around in YoungGen long enough to die before it gets promoted to OldGen). But you have to be confident that your usage patterns won't allow OldGen to fill during the life of your broker's uptime (whether that's hours or years), otherwise you'll end up doing a long full GC and you'd probably have been better off going with G1GC. For our broker, we expire undelivered messages quickly (under a minute), so in theory expanding both YoungGen and Survivor might prevent anything from getting into OldGen and thus prevent long full GCs. I'm actually going to be doing this tuning this week, so I'll report out what I find, though obviously YMMV since everyone's message usage patterns are different. On Tue, Oct 21, 2014 at 5:25 AM, uromahn ulr...@ulrichromahn.net wrote: Another update: I ran the broker with the native Java LevelDB and found that I am still seeing the Warnings in the log file as reported before. However, to my surprise the broker seem to perform better and even slightly faster! I always thought the native LevelDB should be faster but I guess the access via JNI may be less optimal than using an embedded Java (or Scala) engine. -- View this message in context: http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686583.html Sent from the ActiveMQ - User mailing list archive at Nabble.com.
Re: Potential Bug in Master-Slave with Replicated LevelDB Store
Ok, looks like the issue is back again. The network issues have been fixed. It is *not* a slow network - pings between VMs are less than 1ms. I have not investigated the different throughput but wanted to focus on the reliability of the replicated message store. I made some configuration changes to the network connectors: I defined five connectors per node (amq1-3). Here is what I observed: * When I launch one producer connecting to amq1 and one consumer connecting to amq4 and send 100,000 messages, everything works fine * When I launch five producer connecting to amq1 and five consumer connecting to amq4 and send 100,000 messages, still fine * When I launch 10 producer connecting to amq1 and 10 consumer connecting to amq4 and send 100,000 messages, I can see the following: 1. number of pending messages in the queue on amq1 is slowly but steadily increasing, consumer on amq4 is still reading messages 2. after about 70,000 to 80,000 messages amq1 suddenly stops working and amq2 gets promoted to master. amq4 is still reading messages 3. From that time on, the log of amq4 is filling up with the following exceptions: 2014-10-20 13:11:43,227 | ERROR | Exception: org.apache.activemq.transport.InactivityIOException: Cannot send, channel has already failed: null on duplex forward of: ActiveMQTextMessage ... dump of message comes here Here is an excerpt of the log from amq1 at the time it got demoted to slave: 2014-10-20 12:56:44,007 | INFO | Slave has now caught up: 2607dbe5-e42a-44bf-8f90-6edf8caa8d87 | org.apache.activemq.leveldb.replicated.MasterLevelDBStore | hawtdispatch-DEFAULT-1 2014-10-20 13:11:42,535 | INFO | Client session timed out, have not heard from server in 2763ms for sessionid 0x2492d8210c30003, closing socket connection and attempting reconnect | org.apache.zookeeper.ClientCnxn | main-SendThread(uromahn-zk2-9775:2181) 2014-10-20 13:11:42,639 | INFO | Demoted to slave | org.apache.activemq.leveldb.replicated.MasterElector | ZooKeeper state change dispatcher thread (NOTE: 12:56 was the time the broker cluster was started. Between that time and 13:11, I was running the various tests) After that I can see a ton of exceptions and error messages saying that the replicated store has stopped and similar. After some time, it looks the broker amq1 has re-stabilized itself and reporting to have been started as slave. I don't know what exactly is going on, but it appears that something is wrong with the replicated LevelDB which needs more investigation. -- View this message in context: http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686548.html Sent from the ActiveMQ - User mailing list archive at Nabble.com.
Re: Potential Bug in Master-Slave with Replicated LevelDB Store
Take a look at whether the JVM is doing a full garbage collect at the time when the failover occurs. Our team has observed clients to failover to an alternate broker at a time that corresponded to a full GC, and it might be that the same thing is happening here (but the failover isn't happening gracefully). If that's what's going on, you should be able to work around the problem by tuning your JVM heap and/or your GC strategy, though it still sounds like there's a bug related to the failover that should be tracked down and fixed as well. On Mon, Oct 20, 2014 at 7:36 AM, uromahn ulr...@ulrichromahn.net wrote: Ok, looks like the issue is back again. The network issues have been fixed. It is *not* a slow network - pings between VMs are less than 1ms. I have not investigated the different throughput but wanted to focus on the reliability of the replicated message store. I made some configuration changes to the network connectors: I defined five connectors per node (amq1-3). Here is what I observed: * When I launch one producer connecting to amq1 and one consumer connecting to amq4 and send 100,000 messages, everything works fine * When I launch five producer connecting to amq1 and five consumer connecting to amq4 and send 100,000 messages, still fine * When I launch 10 producer connecting to amq1 and 10 consumer connecting to amq4 and send 100,000 messages, I can see the following: 1. number of pending messages in the queue on amq1 is slowly but steadily increasing, consumer on amq4 is still reading messages 2. after about 70,000 to 80,000 messages amq1 suddenly stops working and amq2 gets promoted to master. amq4 is still reading messages 3. From that time on, the log of amq4 is filling up with the following exceptions: 2014-10-20 13:11:43,227 | ERROR | Exception: org.apache.activemq.transport.InactivityIOException: Cannot send, channel has already failed: null on duplex forward of: ActiveMQTextMessage ... dump of message comes here Here is an excerpt of the log from amq1 at the time it got demoted to slave: 2014-10-20 12:56:44,007 | INFO | Slave has now caught up: 2607dbe5-e42a-44bf-8f90-6edf8caa8d87 | org.apache.activemq.leveldb.replicated.MasterLevelDBStore | hawtdispatch-DEFAULT-1 2014-10-20 13:11:42,535 | INFO | Client session timed out, have not heard from server in 2763ms for sessionid 0x2492d8210c30003, closing socket connection and attempting reconnect | org.apache.zookeeper.ClientCnxn | main-SendThread(uromahn-zk2-9775:2181) 2014-10-20 13:11:42,639 | INFO | Demoted to slave | org.apache.activemq.leveldb.replicated.MasterElector | ZooKeeper state change dispatcher thread (NOTE: 12:56 was the time the broker cluster was started. Between that time and 13:11, I was running the various tests) After that I can see a ton of exceptions and error messages saying that the replicated store has stopped and similar. After some time, it looks the broker amq1 has re-stabilized itself and reporting to have been started as slave. I don't know what exactly is going on, but it appears that something is wrong with the replicated LevelDB which needs more investigation. -- View this message in context: http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686548.html Sent from the ActiveMQ - User mailing list archive at Nabble.com.
Re: Potential Bug in Master-Slave with Replicated LevelDB Store
tbain98 wrote In your broker-to-broker networkConnectors, are you using maxReconnects=0 as an argument to the failover URI? It wouldn't explain why amq4 got demoted, but it could explain why messages aren't transferring to amq5 instead. Here is the definition of my networkConnectors inside activemq.xml: networkConnectors networkConnector name=link1 duplex=true conduitSubscriptions=false decreaseNetworkConsumerPriority=false uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/ networkConnector name=link2 duplex=true conduitSubscriptions=false decreaseNetworkConsumerPriority=false uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/ /networkConnectors As you can see, I have used the masterslave transport instead of failover and have not specified any additional configuration parameters. I will try, however, to change this to failover for another test. tbain98 wrote You say you've got duplex connections between the clusters; which cluster is the one that establishes them via a networkConnector? And do you see the same behavior if you put producers on cluster2 and consumers on cluster1? I have not tried that, but will do this first, even before I change masterslave to failover. tbain98 wrote Also, looking at your logs it's not clear what happens between 13:00:48 (when amq5 becomes the master) and 13:32:20 (30 minutes later, when the LevelDB exception occurs). Are messages transferring successfully to amq5, or is it sitting idle? Oops. that was a copypaste error. The message at 13:32:20 is actually caused by me shutting down all my brokers. Sorry for the confusion. And to answer your question: there were absolutely no messages transferred from amq1 to either amq4, amq5, or amq6. This was visible by looking at the Network page in the web console of amq1 which showed the number of messages enqueued and dequeued. I will post back to this thread once I could run the other suggested tests. (removed my original post for brevity) -- View this message in context: http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686488.html Sent from the ActiveMQ - User mailing list archive at Nabble.com.
Re: Potential Bug in Master-Slave with Replicated LevelDB Store
Quick update... I re-ran my tests as suggested. First my producer connected to amq4 and the consumer to amq1. That setup worked quite well without any error or timeout. Then I re-configured it again with messages being sent to amq1 and consumed from amq4. To my surprise it worked this time (I re-ran the test three times yesterday and all three failed the same way!). However, I noticed that when transmitting messages from amq1 to amq4, it appears that the bridge is slower than the other way around since I saw on average 10k pending messages in all queues - the other way around, there were on average less than 1,500 msg pending. On another note: I mentioned that I setup this environment in our private cloud infrastructure. This morning I saw a note from our infrastructure guys that we are having some network issues in that data center hosting this environment. It is certainly possible that my issues yesterday may have been a side-effect of the network problems. So, I will keep testing but for now, my suspicion is that it is likely to be a network problem and not an issue within ActiveMQ. I will follow-up in case the issue shows up again. Until then, sorry for potentially raising a false alarm. -- View this message in context: http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686492.html Sent from the ActiveMQ - User mailing list archive at Nabble.com.
Re: Potential Bug in Master-Slave with Replicated LevelDB Store
masterslave: is an alias for failover: with maxReconnects=0. (There might be another URI option included in the alias, I don't remember; I think the details are in the JIRA where Gary added the failover transport, if you're curious.) So there's no need to try using failover explicitly, since the configuration you're using already used the URI option I was concerned about. That's strange that you're seeing different throughput (based on different numbers of pending messages) based on which direction messages flow between the clusters. It might be due to the network issues you referenced; if not, then hopefully you can figure out which link is the slow one by finding the last point where messages are piling up. Is there a non-trivial amount of latency (more that 15ms, let's say) on any of the links between brokers or the links between clients and brokers? I've had to do quite a bit of tuning to get ActiveMQ to run efficiently across a high-latency WAN, so if you have a bad network link in your setup you may need to make some adjustments to improve throughput. Also, just to confirm: were you comparing pending queue sizes based on which role (producer-side or consumer-side) the cluster was being used for your test? (So comparing amq1-3 in your first setup with amq4-6 in your second setup and vice versa.) Make sure your comparisons were apples to apples between the tests, otherwise the conclusion of lower throughput might not be valid. On Fri, Oct 17, 2014 at 4:20 AM, uromahn ulr...@ulrichromahn.net wrote: Quick update... I re-ran my tests as suggested. First my producer connected to amq4 and the consumer to amq1. That setup worked quite well without any error or timeout. Then I re-configured it again with messages being sent to amq1 and consumed from amq4. To my surprise it worked this time (I re-ran the test three times yesterday and all three failed the same way!). However, I noticed that when transmitting messages from amq1 to amq4, it appears that the bridge is slower than the other way around since I saw on average 10k pending messages in all queues - the other way around, there were on average less than 1,500 msg pending. On another note: I mentioned that I setup this environment in our private cloud infrastructure. This morning I saw a note from our infrastructure guys that we are having some network issues in that data center hosting this environment. It is certainly possible that my issues yesterday may have been a side-effect of the network problems. So, I will keep testing but for now, my suspicion is that it is likely to be a network problem and not an issue within ActiveMQ. I will follow-up in case the issue shows up again. Until then, sorry for potentially raising a false alarm. -- View this message in context: http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686492.html Sent from the ActiveMQ - User mailing list archive at Nabble.com.
Re: Potential Bug in Master-Slave with Replicated LevelDB Store
In your broker-to-broker networkConnectors, are you using maxReconnects=0 as an argument to the failover URI? It wouldn't explain why amq4 got demoted, but it could explain why messages aren't transferring to amq5 instead. You say you've got duplex connections between the clusters; which cluster is the one that establishes them via a networkConnector? And do you see the same behavior if you put producers on cluster2 and consumers on cluster1? Also, looking at your logs it's not clear what happens between 13:00:48 (when amq5 becomes the master) and 13:32:20 (30 minutes later, when the LevelDB exception occurs). Are messages transferring successfully to amq5, or is it sitting idle? On Thu, Oct 16, 2014 at 8:42 AM, uromahn ulr...@ulrichromahn.net wrote: I believe I may have found a bug here. However, this could also be a mis-configuration on my side. Before I go into a detailed description of my observation, here my setup: I have setup the following system in a virtual environment: I have 3 zookeeper nodes. I have 6 ActiveMQ broker using 5.10.0 All nodes (ZK, AMQ) are running on CentOS 6.5 64bit with the latest OpenJDK. Three broker form an active/passive cluster using replicated LevelDB store. I have installed native LevelDB 1,7.0 accessing it via the JNDI driver. The two cluster are forming a network of broker. The networkConnectors are defined in the activemq.xml files in only one cluster as duplex connections. Here is my test case and the situation: Let's name the six broker amq1 - amq6. So, the first active/passive cluster is amq1, amq2, and amq3 with amq1 active and the other two passive. The second cluster consists of amq4, amq5, and amq6 with amq4 as the active. I have one producer connecting to amq1 publishing messages to a VirtualTopic VirtualTopic.Test and a consumer connecting to amq4 reading those messages from a corresponding queue Consumer.A.VirtualTopic.Test. In my test, I am sending 100,000 text messages with a body consisting of 1024 random characters to the VirtualTopic at maximum speed. However, after about 25,000 to 27,000 messages, the consumer on amq4 times out after about 10 seconds not receiving any more messages although the producer has already send all 100,000 messages to amq1. When looking at the log file of amq4, I am seeing the following messages: 2014-10-16 12:53:12,556 | INFO | Started responder end of duplex bridge link2@ID:uromahn-amq1-9110-48269-1413463991773-0:1 | org.apache.activemq.broker.TransportConnection | ActiveMQ Transport: tcp:///xx.xx.xx.xx:58959@61616 2014-10-16 12:53:12,559 | INFO | Started responder end of duplex bridge link1@ID:uromahn-amq1-9110-48269-1413463991773-0:1 | org.apache.activemq.broker.TransportConnection | ActiveMQ Transport: tcp:///xx.xx.xx.xx:58958@61616 2014-10-16 12:53:12,591 | INFO | Network connection between vm://brokergrp2#2 and tcp:///xx.xx.xx.xx:58958@61616 (brokergrp1) has been established. | org.apache.activemq.network.DemandForwardingBridgeSupport | triggerStartAsyncNetworkBridgeCreation: remoteBroker=tcp:///xx.xx.xx.xx:58958@61616, localBroker= vm://brokergrp2#2 2014-10-16 12:53:12,591 | INFO | Network connection between vm://brokergrp2#0 and tcp:///10.64.253.198:58959@61616 (brokergrp1) has been established. | org.apache.activemq.network.DemandForwardingBridgeSupport | triggerStartAsyncNetworkBridgeCreation: remoteBroker=tcp:///xx.xx.xx.xx:58959@61616, localBroker= vm://brokergrp2#0 2014-10-16 13:00:10,470 | INFO | Client session timed out, have not heard from server in 4071ms for sessionid 0x14918fd40f7, closing socket connection and attempting reconnect | org.apache.zookeeper.ClientCnxn | main-SendThread(uromahn-zk1-9208:2181) 2014-10-16 13:00:10,575 | INFO | Demoted to slave | org.apache.activemq.leveldb.replicated.MasterElector | ZooKeeper state change dispatcher thread 2014-10-16 13:00:10,582 | INFO | Apache ActiveMQ 5.10.0 (brokergrp2, ID:uromahn-amq4-9175-46383-1413463934439-0:1) is shutting down | org.apache.activemq.broker.BrokerService | ActiveMQ BrokerService[brokergrp2] Task-8 2014-10-16 13:00:10,594 | WARN | Transport Connection to: tcp://zz.zz.zz.zz:34737 failed: java.io.IOException: Unexpected error occured: org.apache.activemq.broker.BrokerStoppedException: Broker BrokerService[brokergrp2] is being stopped | org.apache.activemq.broker.TransportConnection.Transport | ActiveMQ Transport: tcp:///zz.zz.zz.zz:34737@61616 After that, the log file contains a large number of messages dumped for which the broker could not send the acknowledgement back to amq1. Looking at the log file of the newly promoted master amq5 I see the following warning: 2014-10-16 13:00:18,422 | INFO | Network connection between vm://brokergrp2#0 and tcp:///yy.yy.yy.yy:58256@61616 (brokergrp1) has been established. | org.apache.activemq.network.DemandForwardingBridgeSupport | triggerStartAsyncNetworkBridgeCreation: