[ 
https://issues.apache.org/jira/browse/ARTEMIS-4527?focusedWorklogId=894308&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-894308
 ]

ASF GitHub Bot logged work on ARTEMIS-4527:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 06/Dec/23 14:20
            Start Date: 06/Dec/23 14:20
    Worklog Time Spent: 10m 
      Work Description: AntonRoskvist opened a new pull request, #4705:
URL: https://github.com/apache/activemq-artemis/pull/4705

   …ster
   
   This is a very rare bug but when triggered, messages in the queue with 0 
consumers will have the redistributors loop messages between some or all 
brokers in a cluster as fast as they can manage, until either some system 
resource or the clusterBridges producerFlowControl is reached. Will keep 
happening until consumers are added or cluster bridges are restarted.
   
   I don't have a test for this but instead added a reproducer that works with 
a considerable amount of tweaks. Comments in the reproducer explains how to run 
it. The reproducer is _not_ a valid or reasonable use case... it builds on some 
unrelated work I did that accidentally triggered this. I have seen this 
multiple times in a production environment over the course of several years 
though, I've just been unable to reproduce it outside of production before 
accidentally stumbling on it recently.
   
   Problem occurs when CONSUMER_CREATED notification arrive before the 
BINDING_ADDED notification.
   When that happens the consumerCount for RemoteBinding is incorrect 
(something like 1-2 lower than actual consumerCount value).
   
   Then when consumers disconnect, all are registered properly and 
RemoteBinding gets a negative consumerCount. The `isHighAcceptPriority` used by 
the redistributor checks for consumerCount == 0 but since count is now negative 
it returns as a valid destination.
   
   Fix adds synchronization on the postoffice when processing createConsumer so 
then the previously issued addBinding for sure is done before continuing.
   
   I also added double safety in the RemoteQueueBinding by not lowering 
consumerCount below 0 and also checking for consumerCount <= 0 instead of 
consumerCount == 0, though neither of these should really be necessary if the 
cluster notifications always arrive in the correct order.
   
   If anyone can figure out a consistent way to trigger this issue I'd be happy 
to add it. Regardless, if the changes look good otherwise I think the 
reproducer should be removed rather than merged, leaving it here for 
verification purposes.
   
   One final though is that perhaps all of the sort of create/add/remove 
operations in the 
`org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler` 
should be synchronized?
   Something building on the current pattern of:
   `onMessagePacket()`
   ```
   switch
      fast1:
         fast1Stuff();
      fast2:
        fast2Stuff();
     default:
        slow()
   
   slow:
   switch
      slow1:
         slow1Stuff();
      slow2:
         slow2Stuff();
      default:
        synchronizedStuff()
        
   synchronizedStuff:
   switch
     ...
     ...
   ```




Issue Time Tracking
-------------------

            Worklog Id:     (was: 894308)
    Remaining Estimate: 0h
            Time Spent: 10m

> Redistributor race when consumerCount reaches 0 in cluster
> ----------------------------------------------------------
>
>                 Key: ARTEMIS-4527
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-4527
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>            Reporter: Anton Roskvist
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a very rare bug caused by cluster notifications arriving in the wrong 
> order in some very specific circumstances



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to