[ https://issues.apache.org/jira/browse/ARTEMIS-4527?focusedWorklogId=894308&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-894308 ]
ASF GitHub Bot logged work on ARTEMIS-4527: ------------------------------------------- Author: ASF GitHub Bot Created on: 06/Dec/23 14:20 Start Date: 06/Dec/23 14:20 Worklog Time Spent: 10m Work Description: AntonRoskvist opened a new pull request, #4705: URL: https://github.com/apache/activemq-artemis/pull/4705 …ster This is a very rare bug but when triggered, messages in the queue with 0 consumers will have the redistributors loop messages between some or all brokers in a cluster as fast as they can manage, until either some system resource or the clusterBridges producerFlowControl is reached. Will keep happening until consumers are added or cluster bridges are restarted. I don't have a test for this but instead added a reproducer that works with a considerable amount of tweaks. Comments in the reproducer explains how to run it. The reproducer is _not_ a valid or reasonable use case... it builds on some unrelated work I did that accidentally triggered this. I have seen this multiple times in a production environment over the course of several years though, I've just been unable to reproduce it outside of production before accidentally stumbling on it recently. Problem occurs when CONSUMER_CREATED notification arrive before the BINDING_ADDED notification. When that happens the consumerCount for RemoteBinding is incorrect (something like 1-2 lower than actual consumerCount value). Then when consumers disconnect, all are registered properly and RemoteBinding gets a negative consumerCount. The `isHighAcceptPriority` used by the redistributor checks for consumerCount == 0 but since count is now negative it returns as a valid destination. Fix adds synchronization on the postoffice when processing createConsumer so then the previously issued addBinding for sure is done before continuing. I also added double safety in the RemoteQueueBinding by not lowering consumerCount below 0 and also checking for consumerCount <= 0 instead of consumerCount == 0, though neither of these should really be necessary if the cluster notifications always arrive in the correct order. If anyone can figure out a consistent way to trigger this issue I'd be happy to add it. Regardless, if the changes look good otherwise I think the reproducer should be removed rather than merged, leaving it here for verification purposes. One final though is that perhaps all of the sort of create/add/remove operations in the `org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler` should be synchronized? Something building on the current pattern of: `onMessagePacket()` ``` switch fast1: fast1Stuff(); fast2: fast2Stuff(); default: slow() slow: switch slow1: slow1Stuff(); slow2: slow2Stuff(); default: synchronizedStuff() synchronizedStuff: switch ... ... ``` Issue Time Tracking ------------------- Worklog Id: (was: 894308) Remaining Estimate: 0h Time Spent: 10m > Redistributor race when consumerCount reaches 0 in cluster > ---------------------------------------------------------- > > Key: ARTEMIS-4527 > URL: https://issues.apache.org/jira/browse/ARTEMIS-4527 > Project: ActiveMQ Artemis > Issue Type: Bug > Reporter: Anton Roskvist > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > This is a very rare bug caused by cluster notifications arriving in the wrong > order in some very specific circumstances -- This message was sent by Atlassian Jira (v8.20.10#820010)