Ok, looks like the issue is back again.

The network issues have been fixed.
It is *not* a slow network - pings between VMs are less than 1ms.

I have not investigated the different throughput but wanted to focus on the
reliability of the replicated message store.

I made some configuration changes to the network connectors: I defined five
connectors per node (amq1-3).

Here is what I observed:
* When I launch one producer connecting to amq1 and one consumer connecting
to amq4 and send 100,000 messages, everything works fine
* When I launch five producer connecting to amq1 and five consumer
connecting to amq4 and send 100,000 messages, still fine
* When I launch 10 producer connecting to amq1 and 10 consumer connecting to
amq4 and send 100,000 messages, I can see the following:
  1. number of pending messages in the queue on amq1 is slowly but steadily
increasing, consumer on amq4 is still reading messages
  2. after about 70,000 to 80,000 messages amq1 suddenly stops working and
amq2 gets promoted to master. amq4 is still reading messages
  3. From that time on, the log of amq4 is filling up with the following
exceptions: 2014-10-20 13:11:43,227 | ERROR | Exception:
org.apache.activemq.transport.InactivityIOException: Cannot send, channel
has already failed: null on duplex forward of: ActiveMQTextMessage ... <dump
of message comes here>

Here is an excerpt of the log from amq1 at the time it got "demoted" to
slave:
2014-10-20 12:56:44,007 | INFO  | Slave has now caught up:
2607dbe5-e42a-44bf-8f90-6edf8caa8d87 |
org.apache.activemq.leveldb.replicated.MasterLevelDBStore |
hawtdispatch-DEFAULT-1
2014-10-20 13:11:42,535 | INFO  | Client session timed out, have not heard
from server in 2763ms for sessionid 0x2492d8210c30003, closing socket
connection and attempting reconnect | org.apache.zookeeper.ClientCnxn |
main-SendThread(uromahn-zk2-9775:2181)
2014-10-20 13:11:42,639 | INFO  | Demoted to slave |
org.apache.activemq.leveldb.replicated.MasterElector | ZooKeeper state
change dispatcher thread

(NOTE: 12:56 was the time the broker cluster was started. Between that time
and 13:11, I was running the various tests)

After that I can see a ton of exceptions and error messages saying that the
replicated store has stopped and similar. After some time, it looks the
broker amq1 has re-stabilized itself and reporting to have been started as
slave.

I don't know what exactly is going on, but it appears that something is
wrong with the replicated LevelDB which needs more investigation.



--
View this message in context: 
http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686548.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.

Reply via email to