Based on your suggestion, I looked at the GC behavior of the JVM, and you
were 100% spot on. At the time amq1 gets "demoted" to slave which forces a
failover to amq2 there was a "stop-the-world" GC going on.

Also, I was able to make the failover work correctly with the second cluster
in the network.
In my first cluster consiting of amq1-3, I had my <networkConnectors>
section identical. Each defined connector has a different name as suggested
(I have defined 5 connectors to improve throughput by using 5 concurrent
connections). However, after the failover the "other side" (amq4) was
"complaining" that the connection already exists from amq1 and hence it
rejected the conneciton from amq2.
It looks that in case of such a failover, the connections from amq1 to amq4
won't get cleaned up.

The work-around and solution was to give *every* connection from each
cluster node (amq1, amq2, amq3) a unique name.
So the <networkConnectors> section on amq1 looks like this:
        <networkConnectors>
            <networkConnector name="link1a" duplex="true"
conduitSubscriptions="false" decreaseNetworkConsumerPriority="
false"
uri="masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)"/>
            <networkConnector name="link2a" duplex="true"
conduitSubscriptions="false" decreaseNetworkConsumerPriority="
false"
uri="masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)"/>
            <networkConnector name="link3a" duplex="true"
conduitSubscriptions="false" decreaseNetworkConsumerPriority="
false"
uri="masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)"/>
            <networkConnector name="link4a" duplex="true"
conduitSubscriptions="false" decreaseNetworkConsumerPriority="
false"
uri="masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)"/>
            <networkConnector name="link5a" duplex="true"
conduitSubscriptions="false" decreaseNetworkConsumerPriority="
false"
uri="masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)"/>
        </networkConnectors>

and the same section on amq2 like this:
        <networkConnectors>
            <networkConnector name="link1b" duplex="true"
conduitSubscriptions="false" decreaseNetworkConsumerPriority="
false"
uri="masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)"/>
            <networkConnector name="link2b" duplex="true"
conduitSubscriptions="false" decreaseNetworkConsumerPriority="
false"
uri="masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)"/>
            <networkConnector name="link3b" duplex="true"
conduitSubscriptions="false" decreaseNetworkConsumerPriority="
false"
uri="masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)"/>
            <networkConnector name="link4b" duplex="true"
conduitSubscriptions="false" decreaseNetworkConsumerPriority="
false"
uri="masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)"/>
            <networkConnector name="link5b" duplex="true"
conduitSubscriptions="false" decreaseNetworkConsumerPriority="
false"
uri="masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)"/>
        </networkConnectors>

(notice the different names, e.g. "link1b" vs "link1a")

And similar on amq3 (which I spare you here).

I ran several tests now and it looks like the failover is happening
correctly with no messages getting lost. I had, however, a few cases where
messages got delivered twice. For example, I sent 100,000 messages from my
producer and the consumer actually received 100,043 messages.  Although not
ideal since I always will have to do duplicate checking, it is better than
losing messages.

One additional note: when the failover happens, the "other" active cluster
node in the network (e.g. amq4) is quite often dumping all the messages it
received but could not acknowledge to amq1 to the log. This is not really a
good behavior since nobody would really scan through hundreds of lines in
the log file to identify those messages. It would be better to setup another
DLQ for that and "dump" the messages there rather than in the log file.

I will run some more tests by changing the GC to G1 hopefully avoiding a
full GC and the demotion of the broker to slave forcing a failover.



--
View this message in context: 
http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686576.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.

Reply via email to