blackb1rd opened a new pull request, #1965:
URL: https://github.com/apache/activemq/pull/1965

   ### **Proposed PR Description**
   
   **Title:** Fix race condition causing `InvalidClientIDException` on rapid 
client reconnects
   
   **Summary**
   This PR resolves a Time-of-Check to Time-of-Use (TOCTOU) race condition in 
`RegionBroker.addConnection()` that occurs when a client rapidly reconnects 
after a network drop. Previously, this race condition resulted in an 
unwarranted `InvalidClientIDException`.
   
   **The Problem & Root Cause**
   When a client connection drops and immediately reconnects, the following 
sequence occurs:
   1. The network drops, triggering `TransportConnection.stopAsync()` on the 
old connection. This immediately sets `stopping = true`.
   2. The actual cleanup task (`processRemoveConnection`, which calls 
`broker.removeConnection()`) is scheduled to run asynchronously on a separate 
thread.
   3. If the client reconnects before that asynchronous cleanup completes, 
`RegionBroker.addConnection()` finds a stale entry for the client in the 
`clientIdSet`.
   4. Because `isAllowLinkStealing()` defaults to `false` for TCP/OpenWire, the 
broker rejects the perfectly valid reconnect attempt and throws an 
`InvalidClientIDException`. 
   
   **The Fix**
   Modified `RegionBroker.addConnection()` to inspect the state of the existing 
connection before throwing the exception. 
   
   Inside the `synchronized (clientIdSet)` block, we now check if the existing 
connection has `isStopping() == true`. If it is already in the process of 
stopping, we allow the new connection to proceed—effectively treating it the 
same as link-stealing, since the old connection is already dead and just 
awaiting garbage collection.
   
   **Safety & Side-Effect Analysis**
   This change is clean and introduces no regression risks for the following 
reasons:
   * **Idempotency:** The subsequent `stopAsync()` call on the old connection 
is completely harmless. It uses `compareAndSet(false, true)`, meaning calling 
it on an already-stopping connection is a safe no-op.
   * **Cleanup Thread Safety:** The asynchronous `removeConnection()` cleanup 
relies on a guard (`oldValue == context`). When the delayed task finally 
executes, it will not accidentally remove the newly registered connection 
context. 
   
   ---
   *Fixes #[Insert Issue Number]*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact


Reply via email to