Re: Potential Bug in Master-Slave with Replicated LevelDB Store

2014-10-30 Thread Tim Bain
I spent time last week trying to tune the parallel GC to prevent any
objects from reaching OldGen once the broker was up and running in a steady
state, to try to avoid expensive full GCs.  My goal was zero full GCs for a
broker with 3-6 months of uptime, to prevent clients and other brokers from
failing over from one broker to another.

I increased the size of NewGen relative to OldGen, I increased the size of
Survivor relative to Eden, and I tweaked a few other settings, and I was
never able to avoid a slow stream of objects making it into OldGen that
were deemed dead by the time a full GC happened (usually because I
triggered it manually).  I was able to reduce the rate of object promotion
by about half, and full GCs would probably have been less painful when
OldGen is only 5-10% of the total heap, so the changes should have made
full GCs less frequent and less painful, but I wasn't able to eliminate
them entirely.

So I've given up on the parallel GC and I'm now tweaking G1 to make it
behave as we'd like, and so far the results are far more promising than
with the parallel GC.  So I second Ulrich's recommendation to use G1 rather
than parallel GC, even though the overhead of G1 is several times that of
the parallel GC, if you're more interested in avoiding occasional lengthy
pauses due to full GCs than in getting the highest possible throughput from
your broker.

On Tue, Oct 21, 2014 at 10:13 AM, Tim Bain tb...@alumni.duke.edu wrote:

 G1GC is great for reducing the duration of any single stop-the-world GC
 (and hence minimizing latency of any individual operation as well as
 avoiding timeouts), but the total time spent performing GCs (and hence the
 total amount of time the brokers are paused) is several times that of the
 parallel GC algorithm, based on some articles I read a couple weeks back.
 So although G1GC should work for a wide range (possibly all) of ActiveMQ
 memory usage patterns and may be the right option for you based on how your
 broker is used, you may get better overall throughput from sticking with
 ParallelGC but adjusting the ratio of YoungGen to OldGen to favor YoungGen
 (increasing the odds that a message gets GC'ed before it gets to OldGen)
 and the ratio of Eden to Survivor within YoungGen to favor Survivor (to
 increase the odds that a message can stick around in YoungGen long enough
 to die before it gets promoted to OldGen).  But you have to be confident
 that your usage patterns won't allow OldGen to fill during the life of your
 broker's uptime (whether that's hours or years), otherwise you'll end up
 doing a long full GC and you'd probably have been better off going with
 G1GC.

 For our broker, we expire undelivered messages quickly (under a minute),
 so in theory expanding both YoungGen and Survivor might prevent anything
 from getting into OldGen and thus prevent long full GCs.  I'm actually
 going to be doing this tuning this week, so I'll report out what I find,
 though obviously YMMV since everyone's message usage patterns are different.

 On Tue, Oct 21, 2014 at 5:25 AM, uromahn ulr...@ulrichromahn.net wrote:

 Another update:

 I ran the broker with the native Java LevelDB and found that I am still
 seeing the Warnings in the log file as reported before.

 However, to my surprise the broker seem to perform better and even
 slightly
 faster! I always thought the native LevelDB should be faster but I guess
 the
 access via JNI may be less optimal than using an embedded Java (or Scala)
 engine.




 --
 View this message in context:
 http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686583.html
 Sent from the ActiveMQ - User mailing list archive at Nabble.com.





Re: Potential Bug in Master-Slave with Replicated LevelDB Store

2014-10-21 Thread uromahn
Based on your suggestion, I looked at the GC behavior of the JVM, and you
were 100% spot on. At the time amq1 gets demoted to slave which forces a
failover to amq2 there was a stop-the-world GC going on.

Also, I was able to make the failover work correctly with the second cluster
in the network.
In my first cluster consiting of amq1-3, I had my networkConnectors
section identical. Each defined connector has a different name as suggested
(I have defined 5 connectors to improve throughput by using 5 concurrent
connections). However, after the failover the other side (amq4) was
complaining that the connection already exists from amq1 and hence it
rejected the conneciton from amq2.
It looks that in case of such a failover, the connections from amq1 to amq4
won't get cleaned up.

The work-around and solution was to give *every* connection from each
cluster node (amq1, amq2, amq3) a unique name.
So the networkConnectors section on amq1 looks like this:
networkConnectors
networkConnector name=link1a duplex=true
conduitSubscriptions=false decreaseNetworkConsumerPriority=
false
uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/
networkConnector name=link2a duplex=true
conduitSubscriptions=false decreaseNetworkConsumerPriority=
false
uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/
networkConnector name=link3a duplex=true
conduitSubscriptions=false decreaseNetworkConsumerPriority=
false
uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/
networkConnector name=link4a duplex=true
conduitSubscriptions=false decreaseNetworkConsumerPriority=
false
uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/
networkConnector name=link5a duplex=true
conduitSubscriptions=false decreaseNetworkConsumerPriority=
false
uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/
/networkConnectors

and the same section on amq2 like this:
networkConnectors
networkConnector name=link1b duplex=true
conduitSubscriptions=false decreaseNetworkConsumerPriority=
false
uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/
networkConnector name=link2b duplex=true
conduitSubscriptions=false decreaseNetworkConsumerPriority=
false
uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/
networkConnector name=link3b duplex=true
conduitSubscriptions=false decreaseNetworkConsumerPriority=
false
uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/
networkConnector name=link4b duplex=true
conduitSubscriptions=false decreaseNetworkConsumerPriority=
false
uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/
networkConnector name=link5b duplex=true
conduitSubscriptions=false decreaseNetworkConsumerPriority=
false
uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/
/networkConnectors

(notice the different names, e.g. link1b vs link1a)

And similar on amq3 (which I spare you here).

I ran several tests now and it looks like the failover is happening
correctly with no messages getting lost. I had, however, a few cases where
messages got delivered twice. For example, I sent 100,000 messages from my
producer and the consumer actually received 100,043 messages.  Although not
ideal since I always will have to do duplicate checking, it is better than
losing messages.

One additional note: when the failover happens, the other active cluster
node in the network (e.g. amq4) is quite often dumping all the messages it
received but could not acknowledge to amq1 to the log. This is not really a
good behavior since nobody would really scan through hundreds of lines in
the log file to identify those messages. It would be better to setup another
DLQ for that and dump the messages there rather than in the log file.

I will run some more tests by changing the GC to G1 hopefully avoiding a
full GC and the demotion of the broker to slave forcing a failover.



--
View this message in context: 
http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686576.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.


Re: Potential Bug in Master-Slave with Replicated LevelDB Store

2014-10-21 Thread uromahn
Quick update:

I have enabled G1GC for the JVM running the broker and since then had no
problem again. The master broker stays master even under very heavy load.

So, my suggestion and recommendation when using replicated LevelDB would be
to use the G1 garbage collector significantly reducing stop-the-world GCs
causing a time-out on the connection to Zookeeper which will ultimately
demote a broker to Slave without real reason.

However, while running my tests, I noticed the following Warnings in the
log files of both slaves:
2014-10-21 09:44:37,694 | WARN  | Invalid log position: 1569963680 |
org.apache.activemq.leveldb.LevelDBClient | Thread-2

There are probably hundreds of those (didn't actually count them) with
obviously changing log positions.

I will retry my tests by switching to the Java implementation of LevelDB.




--
View this message in context: 
http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686580.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.


Re: Potential Bug in Master-Slave with Replicated LevelDB Store

2014-10-21 Thread uromahn
Another update:

I ran the broker with the native Java LevelDB and found that I am still
seeing the Warnings in the log file as reported before.

However, to my surprise the broker seem to perform better and even slightly
faster! I always thought the native LevelDB should be faster but I guess the
access via JNI may be less optimal than using an embedded Java (or Scala)
engine.




--
View this message in context: 
http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686583.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.


Re: Potential Bug in Master-Slave with Replicated LevelDB Store

2014-10-21 Thread Tim Bain
G1GC is great for reducing the duration of any single stop-the-world GC
(and hence minimizing latency of any individual operation as well as
avoiding timeouts), but the total time spent performing GCs (and hence the
total amount of time the brokers are paused) is several times that of the
parallel GC algorithm, based on some articles I read a couple weeks back.
So although G1GC should work for a wide range (possibly all) of ActiveMQ
memory usage patterns and may be the right option for you based on how your
broker is used, you may get better overall throughput from sticking with
ParallelGC but adjusting the ratio of YoungGen to OldGen to favor YoungGen
(increasing the odds that a message gets GC'ed before it gets to OldGen)
and the ratio of Eden to Survivor within YoungGen to favor Survivor (to
increase the odds that a message can stick around in YoungGen long enough
to die before it gets promoted to OldGen).  But you have to be confident
that your usage patterns won't allow OldGen to fill during the life of your
broker's uptime (whether that's hours or years), otherwise you'll end up
doing a long full GC and you'd probably have been better off going with
G1GC.

For our broker, we expire undelivered messages quickly (under a minute), so
in theory expanding both YoungGen and Survivor might prevent anything from
getting into OldGen and thus prevent long full GCs.  I'm actually going to
be doing this tuning this week, so I'll report out what I find, though
obviously YMMV since everyone's message usage patterns are different.

On Tue, Oct 21, 2014 at 5:25 AM, uromahn ulr...@ulrichromahn.net wrote:

 Another update:

 I ran the broker with the native Java LevelDB and found that I am still
 seeing the Warnings in the log file as reported before.

 However, to my surprise the broker seem to perform better and even slightly
 faster! I always thought the native LevelDB should be faster but I guess
 the
 access via JNI may be less optimal than using an embedded Java (or Scala)
 engine.




 --
 View this message in context:
 http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686583.html
 Sent from the ActiveMQ - User mailing list archive at Nabble.com.



Re: Potential Bug in Master-Slave with Replicated LevelDB Store

2014-10-20 Thread uromahn
Ok, looks like the issue is back again.

The network issues have been fixed.
It is *not* a slow network - pings between VMs are less than 1ms.

I have not investigated the different throughput but wanted to focus on the
reliability of the replicated message store.

I made some configuration changes to the network connectors: I defined five
connectors per node (amq1-3).

Here is what I observed:
* When I launch one producer connecting to amq1 and one consumer connecting
to amq4 and send 100,000 messages, everything works fine
* When I launch five producer connecting to amq1 and five consumer
connecting to amq4 and send 100,000 messages, still fine
* When I launch 10 producer connecting to amq1 and 10 consumer connecting to
amq4 and send 100,000 messages, I can see the following:
  1. number of pending messages in the queue on amq1 is slowly but steadily
increasing, consumer on amq4 is still reading messages
  2. after about 70,000 to 80,000 messages amq1 suddenly stops working and
amq2 gets promoted to master. amq4 is still reading messages
  3. From that time on, the log of amq4 is filling up with the following
exceptions: 2014-10-20 13:11:43,227 | ERROR | Exception:
org.apache.activemq.transport.InactivityIOException: Cannot send, channel
has already failed: null on duplex forward of: ActiveMQTextMessage ... dump
of message comes here

Here is an excerpt of the log from amq1 at the time it got demoted to
slave:
2014-10-20 12:56:44,007 | INFO  | Slave has now caught up:
2607dbe5-e42a-44bf-8f90-6edf8caa8d87 |
org.apache.activemq.leveldb.replicated.MasterLevelDBStore |
hawtdispatch-DEFAULT-1
2014-10-20 13:11:42,535 | INFO  | Client session timed out, have not heard
from server in 2763ms for sessionid 0x2492d8210c30003, closing socket
connection and attempting reconnect | org.apache.zookeeper.ClientCnxn |
main-SendThread(uromahn-zk2-9775:2181)
2014-10-20 13:11:42,639 | INFO  | Demoted to slave |
org.apache.activemq.leveldb.replicated.MasterElector | ZooKeeper state
change dispatcher thread

(NOTE: 12:56 was the time the broker cluster was started. Between that time
and 13:11, I was running the various tests)

After that I can see a ton of exceptions and error messages saying that the
replicated store has stopped and similar. After some time, it looks the
broker amq1 has re-stabilized itself and reporting to have been started as
slave.

I don't know what exactly is going on, but it appears that something is
wrong with the replicated LevelDB which needs more investigation.



--
View this message in context: 
http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686548.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.


Re: Potential Bug in Master-Slave with Replicated LevelDB Store

2014-10-20 Thread Tim Bain
Take a look at whether the JVM is doing a full garbage collect at the time
when the failover occurs.  Our team has observed clients to failover to an
alternate broker at a time that corresponded to a full GC, and it might be
that the same thing is happening here (but the failover isn't happening
gracefully).  If that's what's going on, you should be able to work around
the problem by tuning your JVM heap and/or your GC strategy, though it
still sounds like there's a bug related to the failover that should be
tracked down and fixed as well.

On Mon, Oct 20, 2014 at 7:36 AM, uromahn ulr...@ulrichromahn.net wrote:

 Ok, looks like the issue is back again.

 The network issues have been fixed.
 It is *not* a slow network - pings between VMs are less than 1ms.

 I have not investigated the different throughput but wanted to focus on the
 reliability of the replicated message store.

 I made some configuration changes to the network connectors: I defined five
 connectors per node (amq1-3).

 Here is what I observed:
 * When I launch one producer connecting to amq1 and one consumer connecting
 to amq4 and send 100,000 messages, everything works fine
 * When I launch five producer connecting to amq1 and five consumer
 connecting to amq4 and send 100,000 messages, still fine
 * When I launch 10 producer connecting to amq1 and 10 consumer connecting
 to
 amq4 and send 100,000 messages, I can see the following:
   1. number of pending messages in the queue on amq1 is slowly but steadily
 increasing, consumer on amq4 is still reading messages
   2. after about 70,000 to 80,000 messages amq1 suddenly stops working and
 amq2 gets promoted to master. amq4 is still reading messages
   3. From that time on, the log of amq4 is filling up with the following
 exceptions: 2014-10-20 13:11:43,227 | ERROR | Exception:
 org.apache.activemq.transport.InactivityIOException: Cannot send, channel
 has already failed: null on duplex forward of: ActiveMQTextMessage ...
 dump
 of message comes here

 Here is an excerpt of the log from amq1 at the time it got demoted to
 slave:
 2014-10-20 12:56:44,007 | INFO  | Slave has now caught up:
 2607dbe5-e42a-44bf-8f90-6edf8caa8d87 |
 org.apache.activemq.leveldb.replicated.MasterLevelDBStore |
 hawtdispatch-DEFAULT-1
 2014-10-20 13:11:42,535 | INFO  | Client session timed out, have not heard
 from server in 2763ms for sessionid 0x2492d8210c30003, closing socket
 connection and attempting reconnect | org.apache.zookeeper.ClientCnxn |
 main-SendThread(uromahn-zk2-9775:2181)
 2014-10-20 13:11:42,639 | INFO  | Demoted to slave |
 org.apache.activemq.leveldb.replicated.MasterElector | ZooKeeper state
 change dispatcher thread

 (NOTE: 12:56 was the time the broker cluster was started. Between that time
 and 13:11, I was running the various tests)

 After that I can see a ton of exceptions and error messages saying that the
 replicated store has stopped and similar. After some time, it looks the
 broker amq1 has re-stabilized itself and reporting to have been started as
 slave.

 I don't know what exactly is going on, but it appears that something is
 wrong with the replicated LevelDB which needs more investigation.



 --
 View this message in context:
 http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686548.html
 Sent from the ActiveMQ - User mailing list archive at Nabble.com.



Re: Potential Bug in Master-Slave with Replicated LevelDB Store

2014-10-17 Thread uromahn
tbain98 wrote
 In your broker-to-broker networkConnectors, are you using maxReconnects=0
 as an argument to the failover URI?  It wouldn't explain why amq4 got
 demoted, but it could explain why messages aren't transferring to amq5
 instead.

Here is the definition of my networkConnectors inside activemq.xml:

networkConnectors
networkConnector name=link1 
  duplex=true 
  conduitSubscriptions=false 
  decreaseNetworkConsumerPriority=false 
 
uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/
networkConnector name=link2 
  duplex=true 
  conduitSubscriptions=false 
  decreaseNetworkConsumerPriority=false 
 
uri=masterslave:(tcp://uromahn-amq4:61616,tcp://uromahn-amq5:61616,tcp://uromahn-amq6:61616)/
/networkConnectors

As you can see, I have used the masterslave transport instead of
failover and have not specified any additional configuration parameters.
I will try, however, to change this to failover for another test.

tbain98 wrote
 You say you've got duplex connections between the clusters; which cluster
 is the one that establishes them via a networkConnector?  And do you see
 the same behavior if you put producers on cluster2 and consumers on
 cluster1?

I have not tried that, but will do this first, even before I change
masterslave to failover.

tbain98 wrote
 Also, looking at your logs it's not clear what happens between 13:00:48
 (when amq5 becomes the master) and 13:32:20 (30 minutes later, when the
 LevelDB exception occurs).  Are messages transferring successfully to
 amq5,
 or is it sitting idle?

Oops. that was a copypaste error. The message at 13:32:20 is actually
caused by me shutting down all my brokers. Sorry for the confusion.
And to answer your question: there were absolutely no messages transferred
from amq1 to either amq4, amq5, or amq6. This was visible by looking at the
Network page in the web console of amq1 which showed the number of
messages enqueued and dequeued.

I will post back to this thread once I could run the other suggested tests.

(removed my original post for brevity)





--
View this message in context: 
http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686488.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.


Re: Potential Bug in Master-Slave with Replicated LevelDB Store

2014-10-17 Thread uromahn
Quick update...

I re-ran my tests as suggested.

First my producer connected to amq4 and the consumer to amq1. That setup
worked quite well without any error or timeout.
Then I re-configured it again with messages being sent to amq1 and consumed
from amq4. To my surprise it worked this time (I re-ran the test three times
yesterday and all three failed the same way!).
However, I noticed that when transmitting messages from amq1 to amq4, it
appears that the bridge is slower than the other way around since I saw on
average 10k pending messages in all queues - the other way around, there
were on average less than 1,500 msg pending.

On another note: I mentioned that I setup this environment in our private
cloud infrastructure. This morning I saw a note from our infrastructure guys
that we are having some network issues in that data center hosting this
environment. It is certainly possible that my issues yesterday may have been
a side-effect of the network problems.

So, I will keep testing but for now, my suspicion is that it is likely to be
a network problem and not an issue within ActiveMQ.

I will follow-up in case the issue shows up again.

Until then, sorry for potentially raising a false alarm.



--
View this message in context: 
http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686492.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.


Re: Potential Bug in Master-Slave with Replicated LevelDB Store

2014-10-17 Thread Tim Bain
masterslave: is an alias for failover: with maxReconnects=0.  (There might
be another URI option included in the alias, I don't remember; I think the
details are in the JIRA where Gary added the failover transport, if you're
curious.)  So there's no need to try using failover explicitly, since the
configuration you're using already used the URI option I was concerned
about.

That's strange that you're seeing different throughput (based on different
numbers of pending messages) based on which direction messages flow between
the clusters.  It might be due to the network issues you referenced; if
not, then hopefully you can figure out which link is the slow one by
finding the last point where messages are piling up.  Is there a
non-trivial amount of latency (more that 15ms, let's say) on any of the
links between brokers or the links between clients and brokers?  I've had
to do quite a bit of tuning to get ActiveMQ to run efficiently across a
high-latency WAN, so if you have a bad network link in your setup you may
need to make some adjustments to improve throughput.

Also, just to confirm: were you comparing pending queue sizes based on
which role (producer-side or consumer-side) the cluster was being used for
your test?  (So comparing amq1-3 in your first setup with amq4-6 in your
second setup and vice versa.)  Make sure your comparisons were apples to
apples between the tests, otherwise the conclusion of lower throughput
might not be valid.

On Fri, Oct 17, 2014 at 4:20 AM, uromahn ulr...@ulrichromahn.net wrote:

 Quick update...

 I re-ran my tests as suggested.

 First my producer connected to amq4 and the consumer to amq1. That setup
 worked quite well without any error or timeout.
 Then I re-configured it again with messages being sent to amq1 and consumed
 from amq4. To my surprise it worked this time (I re-ran the test three
 times
 yesterday and all three failed the same way!).
 However, I noticed that when transmitting messages from amq1 to amq4, it
 appears that the bridge is slower than the other way around since I saw on
 average 10k pending messages in all queues - the other way around, there
 were on average less than 1,500 msg pending.

 On another note: I mentioned that I setup this environment in our private
 cloud infrastructure. This morning I saw a note from our infrastructure
 guys
 that we are having some network issues in that data center hosting this
 environment. It is certainly possible that my issues yesterday may have
 been
 a side-effect of the network problems.

 So, I will keep testing but for now, my suspicion is that it is likely to
 be
 a network problem and not an issue within ActiveMQ.

 I will follow-up in case the issue shows up again.

 Until then, sorry for potentially raising a false alarm.



 --
 View this message in context:
 http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-tp4686450p4686492.html
 Sent from the ActiveMQ - User mailing list archive at Nabble.com.



Re: Potential Bug in Master-Slave with Replicated LevelDB Store

2014-10-16 Thread Tim Bain
In your broker-to-broker networkConnectors, are you using maxReconnects=0
as an argument to the failover URI?  It wouldn't explain why amq4 got
demoted, but it could explain why messages aren't transferring to amq5
instead.

You say you've got duplex connections between the clusters; which cluster
is the one that establishes them via a networkConnector?  And do you see
the same behavior if you put producers on cluster2 and consumers on
cluster1?

Also, looking at your logs it's not clear what happens between 13:00:48
(when amq5 becomes the master) and 13:32:20 (30 minutes later, when the
LevelDB exception occurs).  Are messages transferring successfully to amq5,
or is it sitting idle?

On Thu, Oct 16, 2014 at 8:42 AM, uromahn ulr...@ulrichromahn.net wrote:

 I believe I may have found a bug here. However, this could also be a
 mis-configuration on my side.

 Before I go into a detailed description of my observation, here my setup:

 I have setup the following system in a virtual environment:
 I have 3 zookeeper nodes.
 I have 6 ActiveMQ broker using 5.10.0
 All nodes (ZK, AMQ) are running on CentOS 6.5 64bit with the latest
 OpenJDK.
 Three broker form an active/passive cluster using replicated LevelDB store.
 I have installed native LevelDB 1,7.0 accessing it via the JNDI driver.
 The two cluster are forming a network of broker.
 The networkConnectors are defined in the activemq.xml files in only one
 cluster as duplex connections.

 Here is my test case and the situation:
 Let's name the six broker amq1 - amq6. So, the first active/passive cluster
 is amq1, amq2, and amq3 with amq1 active and the other two passive. The
 second cluster consists of amq4, amq5, and amq6 with amq4 as the active.
 I have one producer connecting to amq1 publishing messages to a
 VirtualTopic
 VirtualTopic.Test and a consumer connecting to amq4 reading those
 messages
 from a corresponding queue Consumer.A.VirtualTopic.Test.
 In my test, I am sending 100,000 text messages with a body consisting of
 1024 random characters to the VirtualTopic at maximum speed.
 However, after about 25,000 to 27,000 messages, the consumer on amq4 times
 out after about 10 seconds not receiving any more messages although the
 producer has already send all 100,000 messages to amq1.
 When looking at the log file of amq4, I am seeing the following messages:
 
 2014-10-16 12:53:12,556 | INFO  | Started responder end of duplex bridge
 link2@ID:uromahn-amq1-9110-48269-1413463991773-0:1 |
 org.apache.activemq.broker.TransportConnection | ActiveMQ Transport:
 tcp:///xx.xx.xx.xx:58959@61616
 2014-10-16 12:53:12,559 | INFO  | Started responder end of duplex bridge
 link1@ID:uromahn-amq1-9110-48269-1413463991773-0:1 |
 org.apache.activemq.broker.TransportConnection | ActiveMQ Transport:
 tcp:///xx.xx.xx.xx:58958@61616
 2014-10-16 12:53:12,591 | INFO  | Network connection between
 vm://brokergrp2#2 and tcp:///xx.xx.xx.xx:58958@61616 (brokergrp1) has been
 established. | org.apache.activemq.network.DemandForwardingBridgeSupport |
 triggerStartAsyncNetworkBridgeCreation:
 remoteBroker=tcp:///xx.xx.xx.xx:58958@61616, localBroker=
 vm://brokergrp2#2
 2014-10-16 12:53:12,591 | INFO  | Network connection between
 vm://brokergrp2#0 and tcp:///10.64.253.198:58959@61616 (brokergrp1) has
 been
 established. | org.apache.activemq.network.DemandForwardingBridgeSupport |
 triggerStartAsyncNetworkBridgeCreation:
 remoteBroker=tcp:///xx.xx.xx.xx:58959@61616, localBroker=
 vm://brokergrp2#0
 2014-10-16 13:00:10,470 | INFO  | Client session timed out, have not heard
 from server in 4071ms for sessionid 0x14918fd40f7, closing socket
 connection and attempting reconnect | org.apache.zookeeper.ClientCnxn |
 main-SendThread(uromahn-zk1-9208:2181)
 2014-10-16 13:00:10,575 | INFO  | Demoted to slave |
 org.apache.activemq.leveldb.replicated.MasterElector | ZooKeeper state
 change dispatcher thread
 2014-10-16 13:00:10,582 | INFO  | Apache ActiveMQ 5.10.0 (brokergrp2,
 ID:uromahn-amq4-9175-46383-1413463934439-0:1) is shutting down |
 org.apache.activemq.broker.BrokerService | ActiveMQ
 BrokerService[brokergrp2] Task-8
 2014-10-16 13:00:10,594 | WARN  | Transport Connection to:
 tcp://zz.zz.zz.zz:34737 failed: java.io.IOException: Unexpected error
 occured: org.apache.activemq.broker.BrokerStoppedException: Broker
 BrokerService[brokergrp2] is being stopped |
 org.apache.activemq.broker.TransportConnection.Transport | ActiveMQ
 Transport: tcp:///zz.zz.zz.zz:34737@61616
 

 After that, the log file contains a large number of messages dumped for
 which the broker could not send the acknowledgement back to amq1.

 Looking at the log file of the newly promoted master amq5 I see the
 following warning:
 
 2014-10-16 13:00:18,422 | INFO  | Network connection between
 vm://brokergrp2#0 and tcp:///yy.yy.yy.yy:58256@61616 (brokergrp1) has been
 established. | org.apache.activemq.network.DemandForwardingBridgeSupport |
 triggerStartAsyncNetworkBridgeCreation: