Based on your suggestion, I looked at the GC behavior of the JVM, and you were 100% spot on. At the time amq1 gets "demoted" to slave which forces a failover to amq2 there was a "stop-the-world" GC going on.
Also, I was able to make the failover work correctly with the second cluster in the network. In my first cluster consiting of amq1-3, I had my <networkConnectors> section identical. Each defined connector has a different name as suggested (I have defined 5 connectors to improve throughput by using 5 concurrent connections). However, after the failover the "other side" (amq4) was "complaining" that the connection already exists from amq1 and hence it rejected the conneciton from amq2. It looks that in case of such a failover, the connections from amq1 to amq4 won't get cleaned up. The work-around and solution was to give *every* connection from each cluster node (amq1, amq2, amq3) a unique name. So the <networkConnectors> section on amq1 looks like this: <networkConnectors> <networkConnector name="link1a" duplex="true" conduitSubscriptions="false" decreaseNetworkConsumerPriority=" false" uri="masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)"/> <networkConnector name="link2a" duplex="true" conduitSubscriptions="false" decreaseNetworkConsumerPriority=" false" uri="masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)"/> <networkConnector name="link3a" duplex="true" conduitSubscriptions="false" decreaseNetworkConsumerPriority=" false" uri="masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)"/> <networkConnector name="link4a" duplex="true" conduitSubscriptions="false" decreaseNetworkConsumerPriority=" false" uri="masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)"/> <networkConnector name="link5a" duplex="true" conduitSubscriptions="false" decreaseNetworkConsumerPriority=" false" uri="masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)"/> </networkConnectors> and the same section on amq2 like this: <networkConnectors> <networkConnector name="link1b" duplex="true" conduitSubscriptions="false" decreaseNetworkConsumerPriority=" false" uri="masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)"/> <networkConnector name="link2b" duplex="true" conduitSubscriptions="false" decreaseNetworkConsumerPriority=" false" uri="masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)"/> <networkConnector name="link3b" duplex="true" conduitSubscriptions="false" decreaseNetworkConsumerPriority=" false" uri="masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)"/> <networkConnector name="link4b" duplex="true" conduitSubscriptions="false" decreaseNetworkConsumerPriority=" false" uri="masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)"/> <networkConnector name="link5b" duplex="true" conduitSubscriptions="false" decreaseNetworkConsumerPriority=" false" uri="masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)"/> </networkConnectors> (notice the different names, e.g. "link1b" vs "link1a") And similar on amq3 (which I spare you here). I ran several tests now and it looks like the failover is happening correctly with no messages getting lost. I had, however, a few cases where messages got delivered twice. For example, I sent 100,000 messages from my producer and the consumer actually received 100,043 messages. Although not ideal since I always will have to do duplicate checking, it is better than losing messages. One additional note: when the failover happens, the "other" active cluster node in the network (e.g. amq4) is quite often dumping all the messages it received but could not acknowledge to amq1 to the log. This is not really a good behavior since nobody would really scan through hundreds of lines in the log file to identify those messages. It would be better to setup another DLQ for that and "dump" the messages there rather than in the log file. I will run some more tests by changing the GC to G1 hopefully avoiding a full GC and the demotion of the broker to slave forcing a failover. -- View this message in context: http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686576.html Sent from the ActiveMQ - User mailing list archive at Nabble.com.