Potential Bug in Master-Slave with Replicated LevelDB Store

uromahn Thu, 16 Oct 2014 07:45:28 -0700

I believe I may have found a bug here. However, this could also be a
mis-configuration on my side.

Before I go into a detailed description of my observation, here my setup:

I have setup the following system in a virtual environment:
I have 3 zookeeper nodes.
I have 6 ActiveMQ broker using 5.10.0
All nodes (ZK, AMQ) are running on CentOS 6.5 64bit with the latest OpenJDK.
Three broker form an active/passive cluster using replicated LevelDB store.
I have installed native LevelDB 1,7.0 accessing it via the JNDI driver.
The two cluster are forming a network of broker.
The "networkConnectors" are defined in the activemq.xml files in only one
cluster as duplex connections.

Here is my test case and the situation:
Let's name the six broker amq1 - amq6. So, the first active/passive cluster
is amq1, amq2, and amq3 with amq1 active and the other two passive. The
second cluster consists of amq4, amq5, and amq6 with amq4 as the active.
I have one producer connecting to amq1 publishing messages to a VirtualTopic
"VirtualTopic.Test" and a consumer connecting to amq4 reading those messages
from a corresponding queue "Consumer.A.VirtualTopic.Test".
In my test, I am sending 100,000 text messages with a body consisting of
1024 random characters to the VirtualTopic at maximum speed.
However, after about 25,000 to 27,000 messages, the consumer on amq4 times
out after about 10 seconds not receiving any more messages although the
producer has already send all 100,000 messages to amq1.
When looking at the log file of amq4, I am seeing the following messages:
....
2014-10-16 12:53:12,556 | INFO | Started responder end of duplex bridge
link2@ID:uromahn-amq1-9110-48269-1413463991773-0:1 |
org.apache.activemq.broker.TransportConnection | ActiveMQ Transport:
tcp:///xx.xx.xx.xx:58959@61616
2014-10-16 12:53:12,559 | INFO | Started responder end of duplex bridge
link1@ID:uromahn-amq1-9110-48269-1413463991773-0:1 |
org.apache.activemq.broker.TransportConnection | ActiveMQ Transport:
tcp:///xx.xx.xx.xx:58958@61616
2014-10-16 12:53:12,591 | INFO | Network connection between
vm://brokergrp2#2 and tcp:///xx.xx.xx.xx:58958@61616 (brokergrp1) has been
established. | org.apache.activemq.network.DemandForwardingBridgeSupport |
triggerStartAsyncNetworkBridgeCreation:
remoteBroker=tcp:///xx.xx.xx.xx:58958@61616, localBroker= vm://brokergrp2#2
2014-10-16 12:53:12,591 | INFO | Network connection between
vm://brokergrp2#0 and tcp:///10.64.253.198:58959@61616 (brokergrp1) has been
established. | org.apache.activemq.network.DemandForwardingBridgeSupport |
triggerStartAsyncNetworkBridgeCreation:
remoteBroker=tcp:///xx.xx.xx.xx:58959@61616, localBroker= vm://brokergrp2#0
2014-10-16 13:00:10,470 | INFO | Client session timed out, have not heard
from server in 4071ms for sessionid 0x14918fd40f00007, closing socket
connection and attempting reconnect | org.apache.zookeeper.ClientCnxn |
main-SendThread(uromahn-zk1-9208:2181)
2014-10-16 13:00:10,575 | INFO | Demoted to slave |
org.apache.activemq.leveldb.replicated.MasterElector | ZooKeeper state
change dispatcher thread
2014-10-16 13:00:10,582 | INFO | Apache ActiveMQ 5.10.0 (brokergrp2,
ID:uromahn-amq4-9175-46383-1413463934439-0:1) is shutting down |
org.apache.activemq.broker.BrokerService | ActiveMQ
BrokerService[brokergrp2] Task-8
2014-10-16 13:00:10,594 | WARN | Transport Connection to:
tcp://zz.zz.zz.zz:34737 failed: java.io.IOException: Unexpected error
occured: org.apache.activemq.broker.BrokerStoppedException: Broker
BrokerService[brokergrp2] is being stopped |
org.apache.activemq.broker.TransportConnection.Transport | ActiveMQ
Transport: tcp:///zz.zz.zz.zz:34737@61616
....

After that, the log file contains a large number of messages dumped for
which the broker could not send the acknowledgement back to amq1.

Looking at the log file of the newly promoted master amq5 I see the
following warning:
....
2014-10-16 13:00:18,422 | INFO | Network connection between
vm://brokergrp2#0 and tcp:///yy.yy.yy.yy:58256@61616 (brokergrp1) has been
established. | org.apache.activemq.network.DemandForwardingBridgeSupport |
triggerStartAsyncNetworkBridgeCreation:
remoteBroker=tcp:///yy.yy.yy.yy:58256@61616, localBroker= vm://brokergrp2#0
2014-10-16 13:32:20,285 | WARN | Unexpected session error:
java.io.IOException: Connection reset by peer |
org.apache.activemq.leveldb.replicated.MasterLevelDBStore |
hawtdispatch-DEFAULT-2
2014-10-16 13:32:20,286 | INFO | Slave has disconnected:
e21da7b3-65a4-4756-a20b-bd77f8eb7607 |
org.apache.activemq.leveldb.replicated.MasterLevelDBStore |
hawtdispatch-DEFAULT-2
....

After that, there are no messages being transmitted from amq1 to the second
cluster in the network (neither amq4 nor amq5). This can be seen when
looking at the corresponding tab of the web console of amq1. Also, when I
let all broker running, I am getting a message on amq4 that the broker is
waiting for the replicated store to synchronize one record. (or something to
this extend - unfortunately I deleted the log already. :( ).

So, here is what appears to happen:
somehow Zookeeper things that amq4 is "dead" and then demotes it to slave in
the middle of the message transfer. This could be because amq4 is too busy
to respond to heatbeat requests. However, something goes wrong in the
LevelDB replication and the new master amq5 can't pick up. This seem to lead
to a "dead-lock" situation causing the client to time-out.
NOTE: all clients are using the failover transport to connect to the
cluster. The connector between the two cluster is also defined using the
failover transport.
Another indication of something "bad" is the fact that the client reports an
exception that the connection to amq4 was reset and it is trying to
reconnect. However, that connection does not seem to happen within 10
seconds because my message listener has an internal timeout of 10 seconds.
I have double and triple-checked the configuration and everything looks ok.
Also, this doesn't happen in case both producer and consumer connect to the
same cluster, e.g. amq1 or amq4. So, the individual cluster seem to work.
The issue only seem to appear when transmitting messages between clusters in
a network of broker.

Any idea if this is a bug or just a SUE (stupid user error). :)

-Uli

--
View this message in context:
http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.

Potential Bug in Master-Slave with Replicated LevelDB Store

Reply via email to