In your broker-to-broker networkConnectors, are you using maxReconnects=0
as an argument to the failover URI?  It wouldn't explain why amq4 got
demoted, but it could explain why messages aren't transferring to amq5
instead.

You say you've got duplex connections between the clusters; which cluster
is the one that establishes them via a networkConnector?  And do you see
the same behavior if you put producers on cluster2 and consumers on
cluster1?

Also, looking at your logs it's not clear what happens between 13:00:48
(when amq5 becomes the master) and 13:32:20 (30 minutes later, when the
LevelDB exception occurs).  Are messages transferring successfully to amq5,
or is it sitting idle?

On Thu, Oct 16, 2014 at 8:42 AM, uromahn <ulr...@ulrichromahn.net> wrote:

> I believe I may have found a bug here. However, this could also be a
> mis-configuration on my side.
>
> Before I go into a detailed description of my observation, here my setup:
>
> I have setup the following system in a virtual environment:
> I have 3 zookeeper nodes.
> I have 6 ActiveMQ broker using 5.10.0
> All nodes (ZK, AMQ) are running on CentOS 6.5 64bit with the latest
> OpenJDK.
> Three broker form an active/passive cluster using replicated LevelDB store.
> I have installed native LevelDB 1,7.0 accessing it via the JNDI driver.
> The two cluster are forming a network of broker.
> The "networkConnectors" are defined in the activemq.xml files in only one
> cluster as duplex connections.
>
> Here is my test case and the situation:
> Let's name the six broker amq1 - amq6. So, the first active/passive cluster
> is amq1, amq2, and amq3 with amq1 active and the other two passive. The
> second cluster consists of amq4, amq5, and amq6 with amq4 as the active.
> I have one producer connecting to amq1 publishing messages to a
> VirtualTopic
> "VirtualTopic.Test" and a consumer connecting to amq4 reading those
> messages
> from a corresponding queue "Consumer.A.VirtualTopic.Test".
> In my test, I am sending 100,000 text messages with a body consisting of
> 1024 random characters to the VirtualTopic at maximum speed.
> However, after about 25,000 to 27,000 messages, the consumer on amq4 times
> out after about 10 seconds not receiving any more messages although the
> producer has already send all 100,000 messages to amq1.
> When looking at the log file of amq4, I am seeing the following messages:
> ....
> 2014-10-16 12:53:12,556 | INFO  | Started responder end of duplex bridge
> link2@ID:uromahn-amq1-9110-48269-1413463991773-0:1 |
> org.apache.activemq.broker.TransportConnection | ActiveMQ Transport:
> tcp:///xx.xx.xx.xx:58959@61616
> 2014-10-16 12:53:12,559 | INFO  | Started responder end of duplex bridge
> link1@ID:uromahn-amq1-9110-48269-1413463991773-0:1 |
> org.apache.activemq.broker.TransportConnection | ActiveMQ Transport:
> tcp:///xx.xx.xx.xx:58958@61616
> 2014-10-16 12:53:12,591 | INFO  | Network connection between
> vm://brokergrp2#2 and tcp:///xx.xx.xx.xx:58958@61616 (brokergrp1) has been
> established. | org.apache.activemq.network.DemandForwardingBridgeSupport |
> triggerStartAsyncNetworkBridgeCreation:
> remoteBroker=tcp:///xx.xx.xx.xx:58958@61616, localBroker=
> vm://brokergrp2#2
> 2014-10-16 12:53:12,591 | INFO  | Network connection between
> vm://brokergrp2#0 and tcp:///10.64.253.198:58959@61616 (brokergrp1) has
> been
> established. | org.apache.activemq.network.DemandForwardingBridgeSupport |
> triggerStartAsyncNetworkBridgeCreation:
> remoteBroker=tcp:///xx.xx.xx.xx:58959@61616, localBroker=
> vm://brokergrp2#0
> 2014-10-16 13:00:10,470 | INFO  | Client session timed out, have not heard
> from server in 4071ms for sessionid 0x14918fd40f00007, closing socket
> connection and attempting reconnect | org.apache.zookeeper.ClientCnxn |
> main-SendThread(uromahn-zk1-9208:2181)
> 2014-10-16 13:00:10,575 | INFO  | Demoted to slave |
> org.apache.activemq.leveldb.replicated.MasterElector | ZooKeeper state
> change dispatcher thread
> 2014-10-16 13:00:10,582 | INFO  | Apache ActiveMQ 5.10.0 (brokergrp2,
> ID:uromahn-amq4-9175-46383-1413463934439-0:1) is shutting down |
> org.apache.activemq.broker.BrokerService | ActiveMQ
> BrokerService[brokergrp2] Task-8
> 2014-10-16 13:00:10,594 | WARN  | Transport Connection to:
> tcp://zz.zz.zz.zz:34737 failed: java.io.IOException: Unexpected error
> occured: org.apache.activemq.broker.BrokerStoppedException: Broker
> BrokerService[brokergrp2] is being stopped |
> org.apache.activemq.broker.TransportConnection.Transport | ActiveMQ
> Transport: tcp:///zz.zz.zz.zz:34737@61616
> ....
>
> After that, the log file contains a large number of messages dumped for
> which the broker could not send the acknowledgement back to amq1.
>
> Looking at the log file of the newly promoted master amq5 I see the
> following warning:
> ....
> 2014-10-16 13:00:18,422 | INFO  | Network connection between
> vm://brokergrp2#0 and tcp:///yy.yy.yy.yy:58256@61616 (brokergrp1) has been
> established. | org.apache.activemq.network.DemandForwardingBridgeSupport |
> triggerStartAsyncNetworkBridgeCreation:
> remoteBroker=tcp:///yy.yy.yy.yy:58256@61616, localBroker=
> vm://brokergrp2#0
> 2014-10-16 13:32:20,285 | WARN  | Unexpected session error:
> java.io.IOException: Connection reset by peer |
> org.apache.activemq.leveldb.replicated.MasterLevelDBStore |
> hawtdispatch-DEFAULT-2
> 2014-10-16 13:32:20,286 | INFO  | Slave has disconnected:
> e21da7b3-65a4-4756-a20b-bd77f8eb7607 |
> org.apache.activemq.leveldb.replicated.MasterLevelDBStore |
> hawtdispatch-DEFAULT-2
> ....
>
> After that, there are no messages being transmitted from amq1 to the second
> cluster in the network (neither amq4 nor amq5). This can be seen when
> looking at the corresponding tab of the web console of amq1. Also, when I
> let all broker running, I am getting a message on amq4 that the broker is
> waiting for the replicated store to synchronize one record. (or something
> to
> this extend - unfortunately I deleted the log already. :( ).
>
> So, here is what appears to happen:
> somehow Zookeeper things that amq4 is "dead" and then demotes it to slave
> in
> the middle of the message transfer. This could be because amq4 is too busy
> to respond to heatbeat requests. However, something goes wrong in the
> LevelDB replication and the new master amq5 can't pick up. This seem to
> lead
> to a "dead-lock" situation causing the client to time-out.
> NOTE: all clients are using the failover transport to connect to the
> cluster. The connector between the two cluster is also defined using the
> failover transport.
> Another indication of something "bad" is the fact that the client reports
> an
> exception that the connection to amq4 was reset and it is trying to
> reconnect. However, that connection does not seem to happen within 10
> seconds because my message listener has an internal timeout of 10 seconds.
> I have double and triple-checked the configuration and everything looks ok.
> Also, this doesn't happen in case both producer and consumer connect to the
> same cluster, e.g. amq1 or amq4. So, the individual cluster seem to work.
> The issue only seem to appear when transmitting messages between clusters
> in
> a network of broker.
>
> Any idea if this is a bug or just a SUE (stupid user error). :)
>
> -Uli
>
>
>
> --
> View this message in context:
> http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450.html
> Sent from the ActiveMQ - User mailing list archive at Nabble.com.
>

Reply via email to