[ 
https://issues.apache.org/jira/browse/ARTEMIS-5086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Pascal Briquet updated ARTEMIS-5086:
-----------------------------------------
    Attachment: image-2024-11-21-11-08-16-869.png

> Cluster connection randomly fails and stop message redistribution
> -----------------------------------------------------------------
>
>                 Key: ARTEMIS-5086
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-5086
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker, Clustering
>    Affects Versions: 2.30.0, 2.35.0, 2.36.0
>            Reporter: Jean-Pascal Briquet
>            Priority: Major
>         Attachments: address-settings.xml, cluster-connections-stop.log, 
> image-2024-10-08-14-26-51-937.png, image-2024-11-21-11-04-58-242.png, 
> image-2024-11-21-11-08-16-869.png, message-events-during-incident-1.log, 
> pr21-broker.xml
>
>
> h4. Context
> In a cluster of 3 primary/backup pairs, it can happen that the cluster 
> connection randomly fails and does not automatically recover.
> The frequency of the problem is random and it can happens once every few 
> weeks.
> When cluster-connectivity is degraded, it stops the message flow between 
> brokers and interrupts the message redistribution.
> Not all cluster nodes may be affected, some may still maintain 
> cluster-connectivity, while others are partially affected, and some can lose 
> all connectivity.
> There are no errors visible in logs when the issue occurs.
> h4. Workaround
> An operator has to stop and start the cluster-connection from the JMX 
> management.
> This means that the message redistribution can be interrupted for a 
> potentially long time until it is manually restarted.
> h4. How to recognize the problem
> The cluster-connections JMX panel indicates that:
>  - cluster-connectivity is started
>  - topology is correct and contains all nodes (3 members, 6 nodes)
>  - nodes fields is either empty, or contains only one entry (instead of two 
> when everything works). In my opinion, this is the main indicator, when it 
> works well, nodes should be equal to "members in topology - 1"
> h4. Consequences
>  - Messages are stuck in {{$.artemis.internal.sf.artemis-cluster.*}} queues 
> until the cluster connection is restarted.
>  - Messages are stuck in {{notif.*}} queues until the cluster connection is 
> restarted
>  - Consumers are starved as message redistribution is broken
>  
> h4. Potential trigger
> I have observed this issue several times over the past months, but 
> unfortunately, I don't have a reproduction case.
> I would have preferred something more predictable but it seems to be a random 
> problem.
> When the issue occurred this week, I noticed a strange coincidence, we 
> deployed a configuration change (addition of 10 new addresses) at the same 
> time on two different clusters.
> Configuration refresh is enabled, and during the upgrade process, we touch 
> the broker.xml to trigger the config reload (so 6 * 2 nodes had configuration 
> reloaded).
> On both clusters, one node had correct cluster connectivity (nodes=2), one 
> node only one connection (nodes=1), and one node no connections at all 
> (nodes=0).
> Maybe I'm wrong, but the fact that it happened on two clusters after the same 
> operation let me think there is maybe something related.
> Please note that most of the time the config reload is working very well and 
> it does not impact cluster-connections.
>  
> h4. Investigation
> Since I don't have a clear reproduction scenario, I checked the code to 
> understand when the {{ClusterConnectionImpl.getNodes()}} could return an 
> empty list.
> It seems that nodes are not listed when:
>  - record list is empty, or
>  - record list has elements but session is null, or
>  - record list has elements but forward connection is null
> During the last incident, we have enabled TRACE level on:
>  * {{org.apache.activemq.artemis.core.server.cluster}}
>  * {{org.apache.activemq.artemis.core.client}}
> When we performed the stop operation on cluster-connections the traces 
> indicated that:
>  - record list had two entries (2 bridges, which is good)
>  - {{session}} had a value (not sure about {{{}sessionConsumer{}}})
>  - forward connection is the last element that could be null
> These stop traces are provided in attachment, if you want to review them.
> Based on that, I believe the list was empty because: "forward connection was 
> null".
> The {{getNode}} contains a specific null check for the forward connection, so 
> it seems that this null state can occur occasionally. When could it happen?
> I would expect the bridge auto-reconnection logic to restore the connection, 
> but it does not seems to detect it as it never recover.
> Sorry it is a bit vague, but if you have tips for further investigation, I 
> would be happy to try and provide more information.
>  
> *Grafana visualisation of the depth of notif.* queue when the incident 
> occured:*
>  * primary-1 had 0 cluster-connection nodes
>  * primary-2 had 2 cluster-connection nodes
>  * primary-3 had 1 cluster-connection nodes
> !image-2024-10-08-14-26-51-937.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact


Reply via email to