AntonRoskvist opened a new pull request, #4705:
URL: https://github.com/apache/activemq-artemis/pull/4705
…ster
This is a very rare bug but when triggered, messages in the queue with 0
consumers will have the redistributors loop messages between some or all
brokers in a cluster as fast as they can manage, until either some system
resource or the clusterBridges producerFlowControl is reached. Will keep
happening until consumers are added or cluster bridges are restarted.
I don't have a test for this but instead added a reproducer that works with
a considerable amount of tweaks. Comments in the reproducer explains how to run
it. The reproducer is _not_ a valid or reasonable use case... it builds on some
unrelated work I did that accidentally triggered this. I have seen this
multiple times in a production environment over the course of several years
though, I've just been unable to reproduce it outside of production before
accidentally stumbling on it recently.
Problem occurs when CONSUMER_CREATED notification arrive before the
BINDING_ADDED notification.
When that happens the consumerCount for RemoteBinding is incorrect
(something like 1-2 lower than actual consumerCount value).
Then when consumers disconnect, all are registered properly and
RemoteBinding gets a negative consumerCount. The `isHighAcceptPriority` used by
the redistributor checks for consumerCount == 0 but since count is now negative
it returns as a valid destination.
Fix adds synchronization on the postoffice when processing createConsumer so
then the previously issued addBinding for sure is done before continuing.
I also added double safety in the RemoteQueueBinding by not lowering
consumerCount below 0 and also checking for consumerCount <= 0 instead of
consumerCount == 0, though neither of these should really be necessary if the
cluster notifications always arrive in the correct order.
If anyone can figure out a consistent way to trigger this issue I'd be happy
to add it. Regardless, if the changes look good otherwise I think the
reproducer should be removed rather than merged, leaving it here for
verification purposes.
One final though is that perhaps all of the sort of create/add/remove
operations in the
`org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler`
should be synchronized?
Something building on the current pattern of:
`onMessagePacket()`
```
switch
fast1:
fast1Stuff();
fast2:
fast2Stuff();
default:
slow()
slow:
switch
slow1:
slow1Stuff();
slow2:
slow2Stuff();
default:
synchronizedStuff()
synchronizedStuff:
switch
...
...
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]