[jira] [Updated] (ARTEMIS-5086) Cluster connection randomly fails and stop message redistribution

Jean-Pascal Briquet (Jira) Mon, 25 Nov 2024 02:55:29 -0800


     [ 
https://issues.apache.org/jira/browse/ARTEMIS-5086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jean-Pascal Briquet updated ARTEMIS-5086:
-----------------------------------------
    Description: 
h3. *Context*

In a cluster of 3 primary/backup pairs, it can happen that the cluster 
connection randomly fails and does not automatically recover.
The frequency of the problem is random and it can happens once every few weeks.
When cluster-connectivity is degraded, it stops the message flow between 
brokers and interrupts the message redistribution.
Not all cluster nodes may be affected, some may still maintain 
cluster-connectivity, while others are partially affected, and some can lose 
all connectivity.
There are no errors visible in logs when the issue occurs.
h3. *Workaround*

+Disable auto-deletion+

Set config-delete-addresses and config-delete-queues to OFF in address settings 
configuration.
Remove unneeded queues via JMX or through the administration console, until a 
correction is available.

 

+Restarting cluster-connections+

The flow can be restored if an operator stop and start the cluster-connection 
from the JMX management.
This means that the message redistribution can be interrupted for a potentially 
long time until it is manually restarted.
h3. *How to recognize the problem*

The cluster-connections JMX panel indicates that:
 - cluster-connectivity is started
 - topology is correct and contains all nodes (3 members, 6 nodes)
 - nodes fields is either empty, or contains only one entry (instead of two 
when everything works). In my opinion, this is the main indicator, when it 
works well, nodes should be equal to "members in topology - 1"

The following log appears and is related to the deletion of the 
cluster-connection queue during a configuration hot-reload :

 
{code:java}
AMQ224077: Undeploying queue $.artemis.internal.sf.......... {code}
 
h3. *Consequences*
 - Messages are stuck in {{$.artemis.internal.sf.artemis-cluster.*}} queues 
until the cluster connection is restarted.
 - Messages are stuck in {{notif.*}} queues until the cluster connection is 
restarted
 - Consumers are starved as message redistribution is broken
 - Messages stuck are lost when the cluster-connection is restarted

 
h3. *Reproduction scenarios*

+Configuration+
 - Artemis cluster with node-1, node-2 and node-3
 - Configuration reload enabled
{code:java}
<configuration-file-refresh-period>5000</configuration-file-refresh-period>{code}

 - Address-settings containing automatically removable queue configuration

{code:java}
  <address-setting match="queue.#">
    <config-delete-addresses>FORCE</config-delete-addresses>
    <config-delete-queues>FORCE</config-delete-queues>
  </address-setting>{code}
 - Addresses defined
{code:java}
  <addresses xmlns="urn:activemq:core">
    <address name="queue.A">
        <anycast>
            <queue name="queue.A"/>
        </anycast>
    </address>
    <address name="queue.B">
        <anycast>
            <queue name="queue.B"/>
        </anycast>
    </address>
  </addresses>{code}

 

*+Cluster-connection broken reproduction scenario :+*
 * Start the Artemis cluster and node-1, node-2 and node-3
 * Remove the "queue.B" address and associated queue from configuration file.
 * Touch broker.xml (if config is managed externally) to trigger reload
 * Upon configuration is reloaded:
 ** $.artemis.internal.sf queues are removed
 ** cluster-connection bridges are disconnected and enter an inconsistent state 
(refer to the investigation section below for details).

+Logs:+
{code:java}
2024-11-25 08:14:43,772 INFO  [org.apache.activemq.artemis.core.server] 
AMQ221056: Reloading configuration: addresses
2024-11-25 08:14:43,773 INFO  [org.apache.activemq.artemis.core.server] 
AMQ224077: Undeploying queue 
$.artemis.internal.sf.artemis-cluster.2fe52843-a828-11ef-8368-0242ac14001f
2024-11-25 08:14:43,786 INFO  [org.apache.activemq.artemis.core.server] 
AMQ224077: Undeploying queue 
$.artemis.internal.sf.artemis-cluster.30699af5-a828-11ef-bbe0-0242ac140015
2024-11-25 08:14:43,796 INFO  [org.apache.activemq.artemis.core.server] 
AMQ224077: Undeploying queue queue.B
2024-11-25 08:14:43,802 INFO  [org.apache.activemq.artemis.core.server] 
AMQ224076: Undeploying address queue.B{code}
 

+*Message loss reproduction scenario:*+
 * Create a consumer connected to queue.A on node-2
 * Follow the same staeps as in the "Cluster-connection broken" case
 * New messages sent toward queue.A on node-1 are immediately redistributed and 
acknowledged.
 * These messages will never arrive to node-2, and are not in local node-1 
queues, leading to messages loss.

 
h3. *Root cause analysis*

The configuration reload is removing $.artemis.internal.sf queues by error when 
an address is deleted from configuration.
This action triggers the disconnect() method on the cluster connection bridge, 
leaving it in an inconsistent state and causing message loss.

Call stack:
{code:java}
ActiveMQServerImpl.undeployAddressesAndQueueNotInConfiguration
-> ActiveMQServerImpl.listQueues(address)
---> PostOfficeImpl.listQueuesForAddress{code}
listQueuesForAddress returns queues based on local bindings AND remote bindings 
 :
{code:java}
0 = {QueueImpl@10860} 
"QueueImpl[name=$.artemis.internal.sf.artemis-cluster.30699af5-a828-11ef-bbe0-0242ac140015,
 postOffice=PostOfficeImpl 
[server=ActiveMQServerImpl::name=artemis-dc1-primary-1], temp=false]@77dd8218"
1 = {QueueImpl@12358} "QueueImpl[name=queue.B, postOffice=PostOfficeImpl 
[server=ActiveMQServerImpl::name=artemis-dc1-primary-1], temp=false]@7d207349"
2 = {QueueImpl@12359} 
"QueueImpl[name=$.artemis.internal.sf.artemis-cluster.2fe52843-a828-11ef-8368-0242ac14001f,
 postOffice=PostOfficeImpl 
[server=ActiveMQServerImpl::name=artemis-dc1-primary-1], 
temp=false]@6e1ab32f"{code}
*Possible fixes:*
+Filter SNF queues+
 - Cluster snf queues should be filtered from listQueuesForAddress results. 
This will prevent the configuration reload process from removing these queues.

+Bridge robustness+
The "disconnect()" method set the bridge into an inconsistent state.
Ideally, properly update bridge state to stopped and trigger a reconnection 
attempt to restore bridge behaviour.

 
h3. *Investigation*
h4. Investigation (2024-10-03)

Since I don't have a clear reproduction scenario, I checked the code to 
understand when the {{ClusterConnectionImpl.getNodes()}} could return an empty 
list.
It seems that nodes are not listed when:
 - record list is empty, or
 - record list has elements but session is null, or
 - record list has elements but forward connection is null

During the last incident, we have enabled TRACE level on:
 * {{org.apache.activemq.artemis.core.server.cluster}}
 * {{org.apache.activemq.artemis.core.client}}

When we performed the stop operation on cluster-connections the traces 
indicated that:
 - record list had two entries (2 bridges, which is good)
 - {{session}} had a value (not sure about {{{}sessionConsumer{}}})
 - forward connection is the last element that could be null
These stop traces are provided in attachment, if you want to review them.

Based on that, I believe the list was empty because: "forward connection was 
null".

The {{getNode}} contains a specific null check for the forward connection, so 
it seems that this null state can occur occasionally. When could it happen?

I would expect the bridge auto-reconnection logic to restore the connection, 
but it does not seems to detect it as it never recover.

Sorry it is a bit vague, but if you have tips for further investigation, I 
would be happy to try and provide more information.

 

*Investigation Update (2024-11-21)*

The problem occured again, and I now have several heap dumps of Artemis nodes. 
Within these heap dumps, I have seen that:

+ClusterConnectionImpl State looks good:+
 * ClusterConnectionImpl.status is started
 * ClusterConnection.records contains 2 entries

+Record details state is valid:+

!image-2024-11-21-11-04-58-242.png|width=655,height=261!

+*Bridge state does not looks good* and is started but it has no session+

!image-2024-11-21-11-08-16-869.png|width=1006,height=864!

The internal producer is marked as closed and its session is stopped.

 

Upon reviewing the Artemis code, I found that clearing of the sessionConsumer 
and the session without altering the state can be triggered by the method 
BridgeImpl.disconnect() method.

This method can be invoked following the deletion of the 
{{$.artemis.internal.sf.artemis-cluster.}} queue performed in 
ActiveMQServerImpl.undeployAddressesAndQueueNotInConfiguration()

 
{code:java}
AMQ224077: Undeploying queue $.artemis.internal.sf.......... {code}
 

I now suspect that the cluster connection failing is triggered by a 
configuration automatic refresh event.

Each configuration refresh seems to have a chance of deleting the 
"$.artemis.internal" queues.

It is not systematic, during past weeks we had 40 successful config refreshes 
without the queue being removed.

Could the AddressSettings or AddressInfo be corrupted, causing this queue to be 
flagged as removable within 
ActiveMQServerImpl.undeployAddressesAndQueueNotInConfiguration() method ?

 

*Investigation update (2024-11-25)*

Added a root cause analysis, reproduction scenario and workaround.

 

*Grafana visualisation of the depth of notif.* queue when the incident occured:*
 * primary-1 had 0 cluster-connection nodes
 * primary-2 had 2 cluster-connection nodes
 * primary-3 had 1 cluster-connection nodes

!image-2024-10-08-14-26-51-937.png!

 

 

  was:
h3. *Context*

In a cluster of 3 primary/backup pairs, it can happen that the cluster 
connection randomly fails and does not automatically recover.
The frequency of the problem is random and it can happens once every few weeks.
When cluster-connectivity is degraded, it stops the message flow between 
brokers and interrupts the message redistribution.
Not all cluster nodes may be affected, some may still maintain 
cluster-connectivity, while others are partially affected, and some can lose 
all connectivity.
There are no errors visible in logs when the issue occurs.
h3. *Workaround*

+Disable auto-deletion+

Set config-delete-addresses and config-delete-queues to OFF in address settings 
configuration.
Remove unneeded queues via JMX or through the administration console, until a 
correction is available.

 

+Restarting cluster-connections+

The flow can be restored if an operator stop and start the cluster-connection 
from the JMX management.
This means that the message redistribution can be interrupted for a potentially 
long time until it is manually restarted.
h3. *How to recognize the problem*

The cluster-connections JMX panel indicates that:
 - cluster-connectivity is started
 - topology is correct and contains all nodes (3 members, 6 nodes)
 - nodes fields is either empty, or contains only one entry (instead of two 
when everything works). In my opinion, this is the main indicator, when it 
works well, nodes should be equal to "members in topology - 1"

The following log appears and is related to the deletion of the 
cluster-connection queue during a configuration hot-reload :

 
{code:java}
AMQ224077: Undeploying queue $.artemis.internal.sf.......... {code}
 
h3. *Consequences*
 - Messages are stuck in {{$.artemis.internal.sf.artemis-cluster.*}} queues 
until the cluster connection is restarted.
 - Messages are stuck in {{notif.*}} queues until the cluster connection is 
restarted
 - Consumers are starved as message redistribution is broken
 - Messages stuck are lost when the cluster-connection is restarted

 
h3. *Reproduction scenarios*

+Configuration+
- Artemis cluster with node-1, node-2 and node-3
- Configuration reload enabled
{code:java}
<configuration-file-refresh-period>5000</configuration-file-refresh-period>{code}

      
- Address-settings containing automatically removable queue configuration

 

 
{code:java}
  <address-setting match="queue.#">
    <config-delete-addresses>FORCE</config-delete-addresses>
    <config-delete-queues>FORCE</config-delete-queues>
  </address-setting>{code}
 
- Addresses defined
{code:java}
  <addresses xmlns="urn:activemq:core">
    <address name="queue.A">
        <anycast>
            <queue name="queue.A"/>
        </anycast>
    </address>
    <address name="queue.B">
        <anycast>
            <queue name="queue.B"/>
        </anycast>
    </address>
  </addresses>{code}
 

 

*+Cluster-connection broken reproduction scenario :+*
 * Start the Artemis cluster and node-1, node-2 and node-3
 * Remove the "queue.B" address and associated queue from configuration file.
 * Touch broker.xml (if config is managed externally) to trigger reload
 * Upon configuration is reloaded:
 ** $.artemis.internal.sf queues are removed
 ** cluster-connection bridges are disconnected and enter an inconsistent state 
(refer to the investigation section below for details).

+Logs:+
{code:java}
2024-11-25 08:14:43,772 INFO  [org.apache.activemq.artemis.core.server] 
AMQ221056: Reloading configuration: addresses
2024-11-25 08:14:43,773 INFO  [org.apache.activemq.artemis.core.server] 
AMQ224077: Undeploying queue 
$.artemis.internal.sf.artemis-cluster.2fe52843-a828-11ef-8368-0242ac14001f
2024-11-25 08:14:43,786 INFO  [org.apache.activemq.artemis.core.server] 
AMQ224077: Undeploying queue 
$.artemis.internal.sf.artemis-cluster.30699af5-a828-11ef-bbe0-0242ac140015
2024-11-25 08:14:43,796 INFO  [org.apache.activemq.artemis.core.server] 
AMQ224077: Undeploying queue queue.B
2024-11-25 08:14:43,802 INFO  [org.apache.activemq.artemis.core.server] 
AMQ224076: Undeploying address queue.B{code}
 

+*Message loss reproduction scenario:*+
 * Create a consumer connected to queue.A on node-2
 * Follow the same staeps as in the "Cluster-connection broken" case
 * New messages sent toward queue.A on node-1 are immediately redistributed and 
acknowledged.
 * These messages will never arrive to node-2, and are not in local node-1 
queues, leading to messages loss.

 
h3. *Root cause analysis*


The configuration reload is removing $.artemis.internal.sf queues by error when 
an address is deleted from configuration.
This action triggers the disconnect() method on the cluster connection bridge, 
leaving it in an inconsistent state and causing message loss.

Call stack:
{code:java}
ActiveMQServerImpl.undeployAddressesAndQueueNotInConfiguration
-> ActiveMQServerImpl.listQueues(address)
---> PostOfficeImpl.listQueuesForAddress{code}

listQueuesForAddress returns queues based on local bindings AND remote bindings 
 :
{code:java}
0 = {QueueImpl@10860} 
"QueueImpl[name=$.artemis.internal.sf.artemis-cluster.30699af5-a828-11ef-bbe0-0242ac140015,
 postOffice=PostOfficeImpl 
[server=ActiveMQServerImpl::name=artemis-dc1-primary-1], temp=false]@77dd8218"
1 = {QueueImpl@12358} "QueueImpl[name=queue.B, postOffice=PostOfficeImpl 
[server=ActiveMQServerImpl::name=artemis-dc1-primary-1], temp=false]@7d207349"
2 = {QueueImpl@12359} 
"QueueImpl[name=$.artemis.internal.sf.artemis-cluster.2fe52843-a828-11ef-8368-0242ac14001f,
 postOffice=PostOfficeImpl 
[server=ActiveMQServerImpl::name=artemis-dc1-primary-1], 
temp=false]@6e1ab32f"{code}
*Possible fixes:*
+Filter SNF queues+
- Cluster snf queues should be filtered from listQueuesForAddress results. This 
will prevent the configuration reload process from removing these queues.

+Bridge robustness+
The "disconnect()" method set the bridge into an inconsistent state.
Ideally, properly update bridge state to stopped and trigger a reconnection 
attempt to restore bridge behaviour.

 
h3. *Investigation*
h4. Investigation (2024-10-03)

Since I don't have a clear reproduction scenario, I checked the code to 
understand when the {{ClusterConnectionImpl.getNodes()}} could return an empty 
list.
It seems that nodes are not listed when:
 - record list is empty, or
 - record list has elements but session is null, or
 - record list has elements but forward connection is null

During the last incident, we have enabled TRACE level on:
 * {{org.apache.activemq.artemis.core.server.cluster}}
 * {{org.apache.activemq.artemis.core.client}}

When we performed the stop operation on cluster-connections the traces 
indicated that:
 - record list had two entries (2 bridges, which is good)
 - {{session}} had a value (not sure about {{{}sessionConsumer{}}})
 - forward connection is the last element that could be null
These stop traces are provided in attachment, if you want to review them.

Based on that, I believe the list was empty because: "forward connection was 
null".

The {{getNode}} contains a specific null check for the forward connection, so 
it seems that this null state can occur occasionally. When could it happen?

I would expect the bridge auto-reconnection logic to restore the connection, 
but it does not seems to detect it as it never recover.

Sorry it is a bit vague, but if you have tips for further investigation, I 
would be happy to try and provide more information.

 

*Investigation Update (2024-11-21)*

The problem occured again, and I now have several heap dumps of Artemis nodes. 
Within these heap dumps, I have seen that:

+ClusterConnectionImpl State looks good:+
 * ClusterConnectionImpl.status is started
 * ClusterConnection.records contains 2 entries

+Record details state is valid:+

!image-2024-11-21-11-04-58-242.png|width=655,height=261!

+*Bridge state does not looks good* and is started but it has no session+

!image-2024-11-21-11-08-16-869.png|width=1006,height=864!

The internal producer is marked as closed and its session is stopped.

 

Upon reviewing the Artemis code, I found that clearing of the sessionConsumer 
and the session without altering the state can be triggered by the method 
BridgeImpl.disconnect() method.

This method can be invoked following the deletion of the 
{{$.artemis.internal.sf.artemis-cluster.}} queue performed in 
ActiveMQServerImpl.undeployAddressesAndQueueNotInConfiguration()

 
{code:java}
AMQ224077: Undeploying queue $.artemis.internal.sf.......... {code}
 

I now suspect that the cluster connection failing is triggered by a 
configuration automatic refresh event.

Each configuration refresh seems to have a chance of deleting the 
"$.artemis.internal" queues.

It is not systematic, during past weeks we had 40 successful config refreshes 
without the queue being removed.

Could the AddressSettings or AddressInfo be corrupted, causing this queue to be 
flagged as removable within 
ActiveMQServerImpl.undeployAddressesAndQueueNotInConfiguration() method ?

 

*Investigation update (2024-11-25)*

Added a root cause analysis, reproduction scenario and workaround.

 

*Grafana visualisation of the depth of notif.* queue when the incident occured:*
 * primary-1 had 0 cluster-connection nodes
 * primary-2 had 2 cluster-connection nodes
 * primary-3 had 1 cluster-connection nodes

!image-2024-10-08-14-26-51-937.png!

 

 


> Cluster connection randomly fails and stop message redistribution
> -----------------------------------------------------------------
>
>                 Key: ARTEMIS-5086
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-5086
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker, Clustering
>    Affects Versions: 2.30.0, 2.35.0, 2.36.0
>            Reporter: Jean-Pascal Briquet
>            Priority: Major
>         Attachments: address-settings.xml, cluster-connections-stop.log, 
> image-2024-10-08-14-26-51-937.png, image-2024-11-21-11-04-58-242.png, 
> image-2024-11-21-11-08-16-869.png, message-events-during-incident-1.log, 
> pr21-broker.xml
>
>
> h3. *Context*
> In a cluster of 3 primary/backup pairs, it can happen that the cluster 
> connection randomly fails and does not automatically recover.
> The frequency of the problem is random and it can happens once every few 
> weeks.
> When cluster-connectivity is degraded, it stops the message flow between 
> brokers and interrupts the message redistribution.
> Not all cluster nodes may be affected, some may still maintain 
> cluster-connectivity, while others are partially affected, and some can lose 
> all connectivity.
> There are no errors visible in logs when the issue occurs.
> h3. *Workaround*
> +Disable auto-deletion+
> Set config-delete-addresses and config-delete-queues to OFF in address 
> settings configuration.
> Remove unneeded queues via JMX or through the administration console, until a 
> correction is available.
>  
> +Restarting cluster-connections+
> The flow can be restored if an operator stop and start the cluster-connection 
> from the JMX management.
> This means that the message redistribution can be interrupted for a 
> potentially long time until it is manually restarted.
> h3. *How to recognize the problem*
> The cluster-connections JMX panel indicates that:
>  - cluster-connectivity is started
>  - topology is correct and contains all nodes (3 members, 6 nodes)
>  - nodes fields is either empty, or contains only one entry (instead of two 
> when everything works). In my opinion, this is the main indicator, when it 
> works well, nodes should be equal to "members in topology - 1"
> The following log appears and is related to the deletion of the 
> cluster-connection queue during a configuration hot-reload :
>  
> {code:java}
> AMQ224077: Undeploying queue $.artemis.internal.sf.......... {code}
>  
> h3. *Consequences*
>  - Messages are stuck in {{$.artemis.internal.sf.artemis-cluster.*}} queues 
> until the cluster connection is restarted.
>  - Messages are stuck in {{notif.*}} queues until the cluster connection is 
> restarted
>  - Consumers are starved as message redistribution is broken
>  - Messages stuck are lost when the cluster-connection is restarted
>  
> h3. *Reproduction scenarios*
> +Configuration+
>  - Artemis cluster with node-1, node-2 and node-3
>  - Configuration reload enabled
> {code:java}
> <configuration-file-refresh-period>5000</configuration-file-refresh-period>{code}
>  - Address-settings containing automatically removable queue configuration
> {code:java}
>   <address-setting match="queue.#">
>     <config-delete-addresses>FORCE</config-delete-addresses>
>     <config-delete-queues>FORCE</config-delete-queues>
>   </address-setting>{code}
>  - Addresses defined
> {code:java}
>   <addresses xmlns="urn:activemq:core">
>     <address name="queue.A">
>         <anycast>
>             <queue name="queue.A"/>
>         </anycast>
>     </address>
>     <address name="queue.B">
>         <anycast>
>             <queue name="queue.B"/>
>         </anycast>
>     </address>
>   </addresses>{code}
>  
> *+Cluster-connection broken reproduction scenario :+*
>  * Start the Artemis cluster and node-1, node-2 and node-3
>  * Remove the "queue.B" address and associated queue from configuration file.
>  * Touch broker.xml (if config is managed externally) to trigger reload
>  * Upon configuration is reloaded:
>  ** $.artemis.internal.sf queues are removed
>  ** cluster-connection bridges are disconnected and enter an inconsistent 
> state (refer to the investigation section below for details).
> +Logs:+
> {code:java}
> 2024-11-25 08:14:43,772 INFO  [org.apache.activemq.artemis.core.server] 
> AMQ221056: Reloading configuration: addresses
> 2024-11-25 08:14:43,773 INFO  [org.apache.activemq.artemis.core.server] 
> AMQ224077: Undeploying queue 
> $.artemis.internal.sf.artemis-cluster.2fe52843-a828-11ef-8368-0242ac14001f
> 2024-11-25 08:14:43,786 INFO  [org.apache.activemq.artemis.core.server] 
> AMQ224077: Undeploying queue 
> $.artemis.internal.sf.artemis-cluster.30699af5-a828-11ef-bbe0-0242ac140015
> 2024-11-25 08:14:43,796 INFO  [org.apache.activemq.artemis.core.server] 
> AMQ224077: Undeploying queue queue.B
> 2024-11-25 08:14:43,802 INFO  [org.apache.activemq.artemis.core.server] 
> AMQ224076: Undeploying address queue.B{code}
>  
> +*Message loss reproduction scenario:*+
>  * Create a consumer connected to queue.A on node-2
>  * Follow the same staeps as in the "Cluster-connection broken" case
>  * New messages sent toward queue.A on node-1 are immediately redistributed 
> and acknowledged.
>  * These messages will never arrive to node-2, and are not in local node-1 
> queues, leading to messages loss.
>  
> h3. *Root cause analysis*
> The configuration reload is removing $.artemis.internal.sf queues by error 
> when an address is deleted from configuration.
> This action triggers the disconnect() method on the cluster connection 
> bridge, leaving it in an inconsistent state and causing message loss.
> Call stack:
> {code:java}
> ActiveMQServerImpl.undeployAddressesAndQueueNotInConfiguration
> -> ActiveMQServerImpl.listQueues(address)
> ---> PostOfficeImpl.listQueuesForAddress{code}
> listQueuesForAddress returns queues based on local bindings AND remote 
> bindings  :
> {code:java}
> 0 = {QueueImpl@10860} 
> "QueueImpl[name=$.artemis.internal.sf.artemis-cluster.30699af5-a828-11ef-bbe0-0242ac140015,
>  postOffice=PostOfficeImpl 
> [server=ActiveMQServerImpl::name=artemis-dc1-primary-1], temp=false]@77dd8218"
> 1 = {QueueImpl@12358} "QueueImpl[name=queue.B, postOffice=PostOfficeImpl 
> [server=ActiveMQServerImpl::name=artemis-dc1-primary-1], temp=false]@7d207349"
> 2 = {QueueImpl@12359} 
> "QueueImpl[name=$.artemis.internal.sf.artemis-cluster.2fe52843-a828-11ef-8368-0242ac14001f,
>  postOffice=PostOfficeImpl 
> [server=ActiveMQServerImpl::name=artemis-dc1-primary-1], 
> temp=false]@6e1ab32f"{code}
> *Possible fixes:*
> +Filter SNF queues+
>  - Cluster snf queues should be filtered from listQueuesForAddress results. 
> This will prevent the configuration reload process from removing these queues.
> +Bridge robustness+
> The "disconnect()" method set the bridge into an inconsistent state.
> Ideally, properly update bridge state to stopped and trigger a reconnection 
> attempt to restore bridge behaviour.
>  
> h3. *Investigation*
> h4. Investigation (2024-10-03)
> Since I don't have a clear reproduction scenario, I checked the code to 
> understand when the {{ClusterConnectionImpl.getNodes()}} could return an 
> empty list.
> It seems that nodes are not listed when:
>  - record list is empty, or
>  - record list has elements but session is null, or
>  - record list has elements but forward connection is null
> During the last incident, we have enabled TRACE level on:
>  * {{org.apache.activemq.artemis.core.server.cluster}}
>  * {{org.apache.activemq.artemis.core.client}}
> When we performed the stop operation on cluster-connections the traces 
> indicated that:
>  - record list had two entries (2 bridges, which is good)
>  - {{session}} had a value (not sure about {{{}sessionConsumer{}}})
>  - forward connection is the last element that could be null
> These stop traces are provided in attachment, if you want to review them.
> Based on that, I believe the list was empty because: "forward connection was 
> null".
> The {{getNode}} contains a specific null check for the forward connection, so 
> it seems that this null state can occur occasionally. When could it happen?
> I would expect the bridge auto-reconnection logic to restore the connection, 
> but it does not seems to detect it as it never recover.
> Sorry it is a bit vague, but if you have tips for further investigation, I 
> would be happy to try and provide more information.
>  
> *Investigation Update (2024-11-21)*
> The problem occured again, and I now have several heap dumps of Artemis 
> nodes. Within these heap dumps, I have seen that:
> +ClusterConnectionImpl State looks good:+
>  * ClusterConnectionImpl.status is started
>  * ClusterConnection.records contains 2 entries
> +Record details state is valid:+
> !image-2024-11-21-11-04-58-242.png|width=655,height=261!
> +*Bridge state does not looks good* and is started but it has no session+
> !image-2024-11-21-11-08-16-869.png|width=1006,height=864!
> The internal producer is marked as closed and its session is stopped.
>  
> Upon reviewing the Artemis code, I found that clearing of the sessionConsumer 
> and the session without altering the state can be triggered by the method 
> BridgeImpl.disconnect() method.
> This method can be invoked following the deletion of the 
> {{$.artemis.internal.sf.artemis-cluster.}} queue performed in 
> ActiveMQServerImpl.undeployAddressesAndQueueNotInConfiguration()
>  
> {code:java}
> AMQ224077: Undeploying queue $.artemis.internal.sf.......... {code}
>  
> I now suspect that the cluster connection failing is triggered by a 
> configuration automatic refresh event.
> Each configuration refresh seems to have a chance of deleting the 
> "$.artemis.internal" queues.
> It is not systematic, during past weeks we had 40 successful config refreshes 
> without the queue being removed.
> Could the AddressSettings or AddressInfo be corrupted, causing this queue to 
> be flagged as removable within 
> ActiveMQServerImpl.undeployAddressesAndQueueNotInConfiguration() method ?
>  
> *Investigation update (2024-11-25)*
> Added a root cause analysis, reproduction scenario and workaround.
>  
> *Grafana visualisation of the depth of notif.* queue when the incident 
> occured:*
>  * primary-1 had 0 cluster-connection nodes
>  * primary-2 had 2 cluster-connection nodes
>  * primary-3 had 1 cluster-connection nodes
> !image-2024-10-08-14-26-51-937.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact

[jira] [Updated] (ARTEMIS-5086) Cluster connection randomly fails and stop message redistribution

Reply via email to