[jira] [Comment Edited] (ARTEMIS-5086) Cluster connection randomly fails and stop message redistribution

Jean-Pascal Briquet (Jira) Mon, 07 Oct 2024 06:27:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARTEMIS-5086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887343#comment-17887343
 ]


Jean-Pascal Briquet edited comment on ARTEMIS-5086 at 10/7/24 1:26 PM:
-----------------------------------------------------------------------

[~jbertram] , to answer your question, in our context, the Artemis broker is 
central to the communication of hundred of banking/payment services that rely 
on it.
Our goal is to achieve 99.95% availablility and ensure no messages are ever 
lost, which is why we chose this toplogy in the first place.

Having a single primary/backup pair, would be too risky, as we need to perform 
maintenance on parts of the cluster and do rolling deployments without the risk 
of breaking everything.
We have noticed that the duration of the failover from the primary to the 
backup starts to be take a significant amount of time as the number of queues 
in configuration grows.
With a single pair, each failover event would result in service unavailability 
for the applications, which is something we want to avoid.

*Regarding deployment:*
Two clusters deployed, one in a each data center (DC).
A cluster is deployed in one DC and a second cluster (for DR) in the second DC.
Primary/backup are deployed in same DC but in different zones and the 
replication via mirroring occurs cross-DC, providing disaster recovery (no 
consumers read from replicated queues).

*Regarding performance, we have the following application profiles:*
 - applications sensible to message throughput (a few million messages per day)
 - applications sensible to message latency (a few tens of thousands of 
messages per day)
 - application using pub/sub (hundreds of messages per day).
Currently, the applications are configured to switch immediately to one of the 
other pairs if they fail to communicate.
So yes, performance is important for us, but after reviewing resource usages, I 
believe that our Artemis instances are not yet under heavy-load.
Stability, availability and operability are more important at this stage.

*We are aware that redistribution is a limiting factor and adds load on the 
Artemis nodes.* 
To address this, we are implementing custom client connectivity logic to 
balance consumers accross all pairs.
Once enabled, consumers will connect simultaneously to three pairs of the 
cluster, which I think will significantly reduce the need for message 
redistribution and improve latency.

However, the redistribution mechanism will be active and required in certain 
situations :
 - when applications fails to have network connectivity to all active nodes
 - to drain messages from a pair that is scheduled for maintenance
 - when operators want to move all messages of a queue to a single node

*I sincerly hope it gives you more details about the environment in which this 
Artemis cluster is running and what we are trying to achieve.*
*In any case, feel free to ask if you want more details !*


was (Author: JIRAUSER303376):
To answer your question, in our context, the Artemis broker is central to the 
communication of hundred of banking/payment services that rely on it.
Our goal is to achieve 99.95% availablility and ensure no messages are ever 
lost, which is why we chose this toplogy in the first place.

Having a single primary/backup pair, would be too risky, as we need to perform 
maintenance on parts of the cluster and do rolling deployments without the risk 
of breaking everything.
We have noticed that the duration of the failover from the primary to the 
backup starts to be take a significant amount of time as the number of queues 
in configuration grows.
With a single pair, each failover event would result in service unavailability 
for the applications, which is something we want to avoid.

*Regarding deployment:*
Two clusters deployed, one in a each data center (DC).
A cluster is deployed in one DC and a second cluster (for DR) in the second DC.
Primary/backup are deployed in same DC but in different zones and the 
replication via mirroring occurs cross-DC, providing disaster recovery (no 
consumers read from replicated queues).

*Regarding performance, we have the following application profiles:*
- applications sensible to message throughput (a few million messages per day)
- applications sensible to message latency (a few tens of thousands of messages 
per day)
- application using pub/sub (hundreds of messages per day).
Currently, the applications are configured to switch immediately to one of the 
other pairs if they fail to communicate.
So yes, performance is important for us, but after reviewing resource usages, I 
believe that our Artemis instances are not yet under heavy-load.
Stability, availability and operability are more important at this stage.

*We are aware that redistribution is a limiting factor and adds load on the 
Artemis nodes.* 
To address this, we are implementing custom client connectivity logic to 
balance consumers accross all pairs.
Once enabled, consumers will connect simultaneously to three pairs of the 
cluster, which I think will significantly reduce the need for message 
redistribution and improve latency.

However, the redistribution mechanism will be active and required in certain 
situations :
- when applications fails to have network connectivity to all active nodes
- to drain messages from a pair that is scheduled for maintenance
- when operators want to move all messages of a queue to a single node

*I sincerly hope it gives you more details about the environment in which this 
Artemis cluster is running and what we are trying to achieve.*
*In any case, feel free to ask if you want more details !*

> Cluster connection randomly fails and stop message redistribution
> -----------------------------------------------------------------
>
>                 Key: ARTEMIS-5086
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-5086
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker, Clustering
>    Affects Versions: 2.30.0, 2.35.0, 2.36.0
>            Reporter: Jean-Pascal Briquet
>            Priority: Major
>         Attachments: address-settings.xml, cluster-connections-stop.log, 
> pr21-broker.xml
>
>
> h4. Context
> In a cluster of 3 primary/backup pairs, it can happen that the cluster 
> connection randomly fails and does not automatically recover.
> The frequency of the problem is random and it can happens once every few 
> weeks.
> When cluster-connectivity is degraded, it stops the message flow between 
> brokers and interrupts the message redistribution.
> Not all cluster nodes may be affected, some may still maintain 
> cluster-connectivity, while others are partially affected, and some can lose 
> all connectivity.
> There are no errors visible in logs when the issue occurs.
> h4. Workaround
> An operator has to stop and start the cluster-connection from the JMX 
> management.
> This means that the message redistribution can be interrupted for a 
> potentially long time until it is manually restarted.
> h4. How to recognize the problem
> The cluster-connections JMX panel indicates that:
>  - cluster-connectivity is started
>  - topology is correct and contains all nodes (3 members, 6 nodes)
>  - nodes fields is either empty, or contains only one entry (instead of two 
> when everything works). In my opinion, this is the main indicator, when it 
> works well, nodes should be equal to "members in topology - 1"
> h4. Consequences
>  - Messages are stuck in {{$.artemis.internal.sf.artemis-cluster.*}} queues 
> until the cluster connection is restarted.
>  - Messages are stuck in {{notif.*}} queues until the cluster connection is 
> restarted
>  - Consumers are starved as message redistribution is broken
> h4. Potential trigger
> I have observed this issue several times over the past months, but 
> unfortunately, I don't have a reproduction case.
> I would have preferred something more predictable but it seems to be a random 
> problem.
> When the issue occurred this week, I noticed a strange coincidence, we 
> deployed a configuration change (addition of 10 new addresses) at the same 
> time on two different clusters.
> Configuration refresh is enabled, and during the upgrade process, we touch 
> the broker.xml to trigger the config reload (so 6 * 2 nodes had configuration 
> reloaded).
> On both clusters, one node had correct cluster connectivity (nodes=2), one 
> node only one connection (nodes=1), and one node no connections at all 
> (nodes=0).
> Maybe I'm wrong, but the fact that it happened on two clusters after the same 
> operation let me think there is maybe something related.
> Please note that most of the time the config reload is working very well and 
> it does not impact cluster-connections.
> h4. Investigation
> Since I don't have a clear reproduction scenario, I checked the code to 
> understand when the {{ClusterConnectionImpl.getNodes()}} could return an 
> empty list.
> It seems that nodes are not listed when:
>  - record list is empty, or
>  - record list has elements but session is null, or
>  - record list has elements but forward connection is null
> During the last incident, we have enabled TRACE level on:
> * {{org.apache.activemq.artemis.core.server.cluster}}
> * {{org.apache.activemq.artemis.core.client}}
> When we performed the stop operation on cluster-connections the traces 
> indicated that:
>  - record list had two entries (2 bridges, which is good)
>  - {{session}} had a value (not sure about {{sessionConsumer}})
>  - forward connection is the last element that could be null
> These stop traces are provided in attachment, if you want to review them.
> Based on that, I believe the list was empty because: "forward connection was 
> null".
> The {{getNode}} contains a specific null check for the forward connection, so 
> it seems that this null state can occur occasionally. When could it happen?
> I would expect the bridge auto-reconnection logic to restore the connection, 
> but it does not seems to detect it as it never recover.
> Sorry it is a bit vague, but if you have tips for further investigation, I 
> would be happy to try and provide more information.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact

[jira] [Comment Edited] (ARTEMIS-5086) Cluster connection randomly fails and stop message redistribution

Reply via email to