[jira] [Created] (ARTEMIS-5325) Artemis dead-locked in the final phase of primary/backup initial replication

Jean-Pascal Briquet (Jira) Mon, 24 Feb 2025 06:41:04 -0800

Jean-Pascal Briquet created ARTEMIS-5325:
--------------------------------------------

Summary: Artemis dead-locked in the final phase of primary/backup
initial replication
Key: ARTEMIS-5325
URL: https://issues.apache.org/jira/browse/ARTEMIS-5325
Project: ActiveMQ Artemis
Issue Type: Bug
Components: Broker, Clustering
Affects Versions: 2.39.0, 2.38.0, 2.37.0, 2.36.0
Reporter: Jean-Pascal Briquet
Attachments: PrimaryDeadLockOnBackupSyncTest.java, thread-dump.txt

h2. Configuration

Artemis cluster with three primary/backup pairs using a ZooKeeper quorum.
h2. Description

The initial primary/backup replication can impact the primary (live) node,
causing it to crash or freeze for and extend period.

After an in-depth investigation, I found that the primary becomes dead-locked
because no Netty threads are available to process the replication
synchronization confirmation coming from the backup.
This issue occurs when client application creates too many connections during
the final phase of the replication phase.

Below, I provide details of my investigation and a potential workaround.
A thread-dump and a test-case are attached.
h3. Lock / Unlock

At the very end of the replication process, the Artemis primary locks its
internal state including journal. (see
ReplicationManager.sendSynchronizationDone()).
It then waits for a synchronization confirmation packet from the backup before
releasing the lock (see ReplicationManager.handlePacket()).
This confirmation packet indicates to the primary that the backup is
synchronized and ready for duty.
The confirmation packet signals tha the backup is synchronized. While locked,
the primary is essentially frozen, no operation can proceed on the broker.
Under normal circumstances, this locks lasts only a few seconds or less.

However, in my scenario, the confirmation packet from the backup is never
processed.
As a result, the primary remains locked indefinitely, freezing all activity
until the replication process times out or the Artemis critical analyzer
decides to stop the process.
h3. Confirmation packet handling issue

All incoming packets arriving to Artemis are handled by Netty threads, which
are managed via a dedicated Netty thread-pool of size = 3 * processor count.
After adding low level logs in packet handlers and analyzing tcp dumps, I'm
sure that the confirmation packet is well received by the primary but is never
processed.
Upon inspecting the thread-dump, it is possible to see that no free Artemis
Netty threads are available.
All netty threads are blocked handling connection creation requests while
attempting to send session notification events to other cluster nodes.
However such notification event cannot be sent due to the replication and
journal lock.

During the investigation, I have seen that some client application were
misbehaving, aggressively creating new connections.
When these excessive connection requests occur in the final phase of the
initial replication, they can block all Netty threads, leading to the deadlock.
h2. Workaround

Enable the following configuration in the broker.xml.
{quote}<suppress-session-notifications>true</suppress-session-notifications>
{quote}
This property disable session creation notifications, preventing Netty threads
from being blocked and therefore avoiding the deadlock.
https://activemq.apache.org/components/artemis/documentation/latest/management.html#suppressing-session-notifications

Disabling session notification seems to be acceptable for my use-cases, which
relies on CORE, AMQP and OPENWIRE protocols.
However, according to documentation, this option should not be used with MQTT
protocol.
h2. Test

Add the provided test under
tests/integration-tests/src/test/java/org/apache/activemq/artemis/tests/integration/cluster/failover

--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact

[jira] [Created] (ARTEMIS-5325) Artemis dead-locked in the final phase of primary/backup initial replication

Reply via email to