[PR] ARTEMIS-4527 - Redistributor race when consumerCount reaches 0 in clu… [activemq-artemis]

via GitHub Wed, 06 Dec 2023 06:20:18 -0800


AntonRoskvist opened a new pull request, #4705:
URL: https://github.com/apache/activemq-artemis/pull/4705


   …ster
   
   This is a very rare bug but when triggered, messages in the queue with 0 
consumers will have the redistributors loop messages between some or all 
brokers in a cluster as fast as they can manage, until either some system 
resource or the clusterBridges producerFlowControl is reached. Will keep 
happening until consumers are added or cluster bridges are restarted.
   
   I don't have a test for this but instead added a reproducer that works with 
a considerable amount of tweaks. Comments in the reproducer explains how to run 
it. The reproducer is _not_ a valid or reasonable use case... it builds on some 
unrelated work I did that accidentally triggered this. I have seen this 
multiple times in a production environment over the course of several years 
though, I've just been unable to reproduce it outside of production before 
accidentally stumbling on it recently.
   
   Problem occurs when CONSUMER_CREATED notification arrive before the 
BINDING_ADDED notification.
   When that happens the consumerCount for RemoteBinding is incorrect 
(something like 1-2 lower than actual consumerCount value).
   
   Then when consumers disconnect, all are registered properly and 
RemoteBinding gets a negative consumerCount. The `isHighAcceptPriority` used by 
the redistributor checks for consumerCount == 0 but since count is now negative 
it returns as a valid destination.
   
   Fix adds synchronization on the postoffice when processing createConsumer so 
then the previously issued addBinding for sure is done before continuing.
   
   I also added double safety in the RemoteQueueBinding by not lowering 
consumerCount below 0 and also checking for consumerCount <= 0 instead of 
consumerCount == 0, though neither of these should really be necessary if the 
cluster notifications always arrive in the correct order.
   
   If anyone can figure out a consistent way to trigger this issue I'd be happy 
to add it. Regardless, if the changes look good otherwise I think the 
reproducer should be removed rather than merged, leaving it here for 
verification purposes.
   
   One final though is that perhaps all of the sort of create/add/remove 
operations in the 
`org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler` 
should be synchronized?
   Something building on the current pattern of:
   `onMessagePacket()`
   ```
   switch
      fast1:
         fast1Stuff();
      fast2:
        fast2Stuff();
     default:
        slow()
   
   slow:
   switch
      slow1:
         slow1Stuff();
      slow2:
         slow2Stuff();
      default:
        synchronizedStuff()
        
   synchronizedStuff:
   switch
     ...
     ...
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] ARTEMIS-4527 - Redistributor race when consumerCount reaches 0 in clu… [activemq-artemis]

Reply via email to