Take a look at whether the JVM is doing a full garbage collect at the time
when the failover occurs.  Our team has observed clients to failover to an
alternate broker at a time that corresponded to a full GC, and it might be
that the same thing is happening here (but the failover isn't happening
gracefully).  If that's what's going on, you should be able to work around
the problem by tuning your JVM heap and/or your GC strategy, though it
still sounds like there's a bug related to the failover that should be
tracked down and fixed as well.

On Mon, Oct 20, 2014 at 7:36 AM, uromahn <ulr...@ulrichromahn.net> wrote:

> Ok, looks like the issue is back again.
>
> The network issues have been fixed.
> It is *not* a slow network - pings between VMs are less than 1ms.
>
> I have not investigated the different throughput but wanted to focus on the
> reliability of the replicated message store.
>
> I made some configuration changes to the network connectors: I defined five
> connectors per node (amq1-3).
>
> Here is what I observed:
> * When I launch one producer connecting to amq1 and one consumer connecting
> to amq4 and send 100,000 messages, everything works fine
> * When I launch five producer connecting to amq1 and five consumer
> connecting to amq4 and send 100,000 messages, still fine
> * When I launch 10 producer connecting to amq1 and 10 consumer connecting
> to
> amq4 and send 100,000 messages, I can see the following:
>   1. number of pending messages in the queue on amq1 is slowly but steadily
> increasing, consumer on amq4 is still reading messages
>   2. after about 70,000 to 80,000 messages amq1 suddenly stops working and
> amq2 gets promoted to master. amq4 is still reading messages
>   3. From that time on, the log of amq4 is filling up with the following
> exceptions: 2014-10-20 13:11:43,227 | ERROR | Exception:
> org.apache.activemq.transport.InactivityIOException: Cannot send, channel
> has already failed: null on duplex forward of: ActiveMQTextMessage ...
> <dump
> of message comes here>
>
> Here is an excerpt of the log from amq1 at the time it got "demoted" to
> slave:
> 2014-10-20 12:56:44,007 | INFO  | Slave has now caught up:
> 2607dbe5-e42a-44bf-8f90-6edf8caa8d87 |
> org.apache.activemq.leveldb.replicated.MasterLevelDBStore |
> hawtdispatch-DEFAULT-1
> 2014-10-20 13:11:42,535 | INFO  | Client session timed out, have not heard
> from server in 2763ms for sessionid 0x2492d8210c30003, closing socket
> connection and attempting reconnect | org.apache.zookeeper.ClientCnxn |
> main-SendThread(uromahn-zk2-9775:2181)
> 2014-10-20 13:11:42,639 | INFO  | Demoted to slave |
> org.apache.activemq.leveldb.replicated.MasterElector | ZooKeeper state
> change dispatcher thread
>
> (NOTE: 12:56 was the time the broker cluster was started. Between that time
> and 13:11, I was running the various tests)
>
> After that I can see a ton of exceptions and error messages saying that the
> replicated store has stopped and similar. After some time, it looks the
> broker amq1 has re-stabilized itself and reporting to have been started as
> slave.
>
> I don't know what exactly is going on, but it appears that something is
> wrong with the replicated LevelDB which needs more investigation.
>
>
>
> --
> View this message in context:
> http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686548.html
> Sent from the ActiveMQ - User mailing list archive at Nabble.com.
>

Reply via email to