blackb1rd opened a new pull request, #1965: URL: https://github.com/apache/activemq/pull/1965
### **Proposed PR Description** **Title:** Fix race condition causing `InvalidClientIDException` on rapid client reconnects **Summary** This PR resolves a Time-of-Check to Time-of-Use (TOCTOU) race condition in `RegionBroker.addConnection()` that occurs when a client rapidly reconnects after a network drop. Previously, this race condition resulted in an unwarranted `InvalidClientIDException`. **The Problem & Root Cause** When a client connection drops and immediately reconnects, the following sequence occurs: 1. The network drops, triggering `TransportConnection.stopAsync()` on the old connection. This immediately sets `stopping = true`. 2. The actual cleanup task (`processRemoveConnection`, which calls `broker.removeConnection()`) is scheduled to run asynchronously on a separate thread. 3. If the client reconnects before that asynchronous cleanup completes, `RegionBroker.addConnection()` finds a stale entry for the client in the `clientIdSet`. 4. Because `isAllowLinkStealing()` defaults to `false` for TCP/OpenWire, the broker rejects the perfectly valid reconnect attempt and throws an `InvalidClientIDException`. **The Fix** Modified `RegionBroker.addConnection()` to inspect the state of the existing connection before throwing the exception. Inside the `synchronized (clientIdSet)` block, we now check if the existing connection has `isStopping() == true`. If it is already in the process of stopping, we allow the new connection to proceed—effectively treating it the same as link-stealing, since the old connection is already dead and just awaiting garbage collection. **Safety & Side-Effect Analysis** This change is clean and introduces no regression risks for the following reasons: * **Idempotency:** The subsequent `stopAsync()` call on the old connection is completely harmless. It uses `compareAndSet(false, true)`, meaning calling it on an already-stopping connection is a safe no-op. * **Cleanup Thread Safety:** The asynchronous `removeConnection()` cleanup relies on a guard (`oldValue == context`). When the delayed task finally executes, it will not accidentally remove the newly registered connection context. --- *Fixes #[Insert Issue Number]* -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected] For further information, visit: https://activemq.apache.org/contact
