Hello ActiveMQ community:

TL;DR: I now think this is really a mis-configuration on our part but it took 
quite a lot of digging before we nailed the issue, I am reporting this to save 
others time in the future.

We are running a "store and forward network of brokers" where each broker is 
connected to all other brokers (full mesh). Our applications connect only to 
their local broker. Under load we would occasionally see a broker just 
"disappear" from the rest of the cluster and all of the work would end up on 
the remaining nodes. We were having trouble isolating the fault since our 
overall system wasn't handling this gracefully and was causing other traffic 
making cause and effect difficult to trace down.

I set out to reproduce the failure we were having in as small of a case as I 
could. The result is at: https://github.com/samhendley/activemq-bug-reports 
where I document the experiment more fully. I wasn't able to get a 100% 
reproduction, best I could do was get to about 50% of the runs on my machine 
failing. This makes me believe it is probably a race condition, but I wasn't 
able to find any obvious smoking guns.

In short I found that if the overall broker MemoryUsage is exceeded (because 
producer flow control is off) then sometimes the network connectors between the 
brokers would become stuck. If I enabled producer flow control or increased the 
configured max memory the issue was no longer reproducible.

It looks like we can reconfigure our production systems to workaround this 
problem but should I file a bug for this? A silent failure like this is really 
not fun to run to diagnose on a large scale system.

Sam

>From github page:

Bug description:

If the configured MemoryStore limit is large enough to stay below 100% while 
the requestor application is dumping messages into the broker network the tests 
passes successfully. If however the memory usage on the brokers goes larger 
than 100% (in this case peaking around 600% of 100 Mb) the network connectors 
sometimes become "stuck". Stuck in this case means there are messages enqueued 
on one or both of the "server" brokers but the messages are not being dequeued 
or forwarded by the network connector back to the "client" broker.

This issues doesn't happen with every run with a small memory size but in my 
tests it generally failed about 50% of the times I tried running it. You may 
have to run it a few times before getting it to fail. On one failure JMX showed 
that 417k responses had been generated on server1 but only 363k had been 
dequeued for transmission to the client broker. In that test run the other 
server had correctly handled the other 583k requests.

When it does fail there is nothing in the log that indicates anything is amiss. 
I would have expected to see some sort of log message to indicate that the 
network connector has been throttled (if indeed that is what is happening). 
This same test done with a single broker always passes which leads me to 
believe it really is a problem with the network connectors.


Reply via email to