https://bz.apache.org/bugzilla/show_bug.cgi?id=68884
Bug ID: 68884 Summary: Delayed HTTP Traffic Processing After Mass Websocket Disconnect/Reconnect Product: Tomcat 9 Version: 9.0.75 Hardware: All OS: Linux Status: NEW Severity: normal Priority: P2 Component: WebSocket Assignee: dev@tomcat.apache.org Reporter: inconceiva...@gmail.com Target Milestone: ----- Apache Tomcat Bug Report Delayed HTTP Traffic Processing After Mass Websocket Disconnect/Reconnect Description: A significant delay of 10+ minutes occurs in resuming normal HTTP traffic processing after a mass websocket disconnect/reconnect event. This issue arises when a network interruption or stop-the-world garbage collection event exceeds the maxIdleTimeout (35 seconds), leading to numerous websocket session closures. With several thousand websocket sessions closing simultaneously, all available nio2 threads (maxThreads=50) become occupied with the closure process. These threads enter a continuous loop, repeatedly calling Thread.yield while waiting to acquire the WsRemoteEndpointImplBase messagePartInProgress semaphore. This behavior, introduced as part of the fix for BZ66508, allows closing threads to relinquish CPU time while waiting for the send semaphore (up to the default 20-second timeout). java. base@11.0.21/java.lang.Thread.yield(Native Method) org.apache.tomcat.websocket.server.WsRemoteEndpointImplServer.acquireMessagePartInProgressSemaphore(WsRemoteEndpointImplServer.java:130) org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendMessageBlock(WsRemoteEndpointImplBase-java: 292) org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendMessageBlock(WsRemoteEndpointImplBase. java: 256) org.apache.tomcat.websocket.WsSession.sendCloseMessage(WsSession.java:801) org.apache.tomcat.websocket.WsSession.onClose(WsSession.java:711) Observations indicate that on Linux, Thread.yield places the thread at a lower priority in the CPU scheduling queue, resulting in a prolonged series of yield calls until the timeout is reached and a SocketTimeoutException is triggered. HTTP traffic processing remains stalled until all session closures are completed. We have implemented a temporary solution by introducing a property to limit the time spent in the on-close yield loop. Reducing this value from the default significantly improves recovery time. Additionally, decreasing maxThreads appears to further extend the recovery time, although the exact relationship requires further investigation. Reproducing the Issue: The issue, initially identified in a scenario with 50 threads and 5000 maximum websocket connections, can also be reproduced at a smaller scale with varying thread and session counts. 1. Establish several thousand websocket connections that periodically send/receive data to simulate traffic. 2. Induce a JVM pause or network interruption lasting 40 seconds or more. 3. Restore client-side connectivity. 4. Start a timer and attempt to obtain a 200 response from the server. 5. Stop the timer once a successful response is received. Test Configurations and Results: 5 nio2 threads, 300 websocket connections: Close Timeout Recovery Times (seconds) 10s 218, 300, 159, 168, 312 5s 60, 42, 102, 199, 160 2s 27, 30, 42, 19, 18 1s 13, 15, 15 15 nio2 threads, 300 websocket connections: Close Timeout Recovery Time (seconds) 2s 11, 8, 7, 6, 7, 12 Observations: The issue was initially observed with Tomcat 9.0.75 (embedded) and remains reproducible with versions up to 9.0.82 (embedded), even with the 9.0.86 fix for reentrant lock on close handling applied. While the 9.0.86 fix resolved a memory leak, it did not alleviate the extended recovery times. Proposed Solution: Introducing a separate property specifically for the on-close send timeout would allow for finer-grained control and optimization of session closure behavior, particularly for servers operating with fixed thread pool sizes. Additional Notes: While BZ66508 removed the fixed timeout for on-close acquisition, the potential for a 20-second wait during semaphore acquisition persists, leading to prolonged session closure times and increased overhead on the OS scheduler due to the repeated yield calls. We are investigating the precise relationship between thread count and recovery time and will provide additional data as it becomes available. We believe that implementing the proposed solution would significantly improve Tomcat's performance under these conditions and provide administrators with greater control over resource utilization during mass websocket disconnect events. -- You are receiving this mail because: You are the assignee for the bug. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org