https://bz.apache.org/bugzilla/show_bug.cgi?id=64848
Bug ID: 64848 Summary: WsSession objects in OUTPUT_CLOSED state are implicitly held by waitingProcessors and GC cannot purge them from the JVM heap Product: Tomcat 9 Version: 9.0.36 Hardware: PC Status: NEW Severity: major Priority: P4 Component: WebSocket Assignee: dev@tomcat.apache.org Reporter: laszlo.peter.karo...@gmail.com Target Milestone: ----- Created attachment 37534 --> https://bz.apache.org/bugzilla/attachment.cgi?id=37534&action=edit Attachment with the 3 services that help reproduce the issue + SocketTimeoutException stack trace and WsSession GC root snapshot from a heap dump Overview: --------- WebSocket session objects (represented as WsSession) "get stuck" on the heap under heavy load in case Tomcat acts as a WebSocket API gateway between a client and a server application. By assuming a system configuration where the Tomcat WebSocket API gateway service is deployed along with a server application on the same machine but the client is launched elsewhere then the network latency can impose a considerable overhead on the entire client-side data processing compared to the server side where the data can be generated and transferred to the gateway service much faster than the client can consume it and this can lead to the classic fast producer-slow consumer phenomenon. If the network latency goes beyond a certain threshold and the client cannot keep up in a timely manner with the data flow coming from the gateway service then Tomcat starts throwing SocketTimeoutException (by default after 20 seconds) at the WebSocket sessions for which the data weren't transmitted in time. Such a timeout may end up in an abnormally closed WebSocket connection (usually represented with 1006 status code at the client side) and even though the corresponding sessions are moved into the OUTPUT_CLOSED state at Tomcat level, they are still kept on the JVM heap endlessly by preventing the GC to purge them out consequently producing a slow memory leak. Steps to Reproduce: ------------------- Reproducing such a situation is a bit cumbersome in terms of the required hardware configuration as it needs two machines/VMs: one for the client and another for the Tomcat WebSocket API gateway + server application. On the other hand the client app should be hosted "far" from the Tomcat WebSocket app in terms of network distance, i.e. if it connects to the Tomcat WebSocket app via VPN then the network latency can be enough to reproduce the issue. Alternatively in the Tomcat WebSocket app the org.apache.tomcat.websocket.BLOCKING_SEND_TIMEOUT property can also be adjusted as part of a customized RequestUpgradeStrategy to simulate a slow network. The overall system demonstrates a simple distributed client-server application inserting the Tomcat WebSocket API GW as an intermediary. The client can send a number to the server that denotes a length (given in KBs) so the server will respond with a random alphanumeric string having a length specified by the given number. The Tomcat WebSocket API GW just routes the WS traffic back and forth between the other two services. 1. Run the attached random-string-ws-provider-undertow-1.0.0-SNAPSHOT-jar-with-dependencies application (the server app) in a form of java -jar random-string-ws-provider-undertow-1.0.0-SNAPSHOT-jar-with-dependencies <host> <port> By default it configures the underlying Undertow webserver to launch on localhost and listen on port #8193. 2. Run the attached ws-api-gateway-tomcat-1.0.0-SNAPSHOT.jar application (the Tomcat WebSocket API GW app) in a form of java -jar ws-api-gateway-tomcat-1.0.0-SNAPSHOT.jar By default it listens on port #8444 and it can be overridden by setting the server.port property. If the Undertow server app runs with non-default host and port configurations then this needs to be reflected here by specifying the zuul.routes.random-string-websocket-provider.url property accordingly, e.g.: java -jar -Dzuul.routes.random-string-websocket-provider.url=http://<another-host>:<another-port> ws-api-gateway-tomcat-1.0.0-SNAPSHOT.jar 3. Run the attached ws-random-string-gatling-load-test application (the client app wrapped into gatling to generate artifical load) in a form of mvn clean -B compile exec:java -Dexec.mainClass=RandomStringWebSocketRequestApp -DrampUpUsers=<number-of-concurrent-users> -DrampUpTime=1 -DserverUrl=<host-where-other-two-services-run> -DserverPort=<port-where-Tomcat-API-GW-listens-on> -Dgatling.simulationClass=com.acme.wsrequest.simulation.RandomStringWebSocketRequestSimulation -DrandomStringLengthInKb=1000 Actual Results: --------------- Running the client app with 400 users will start producing the SocketTimeoutException confidently in the Tomcat WebSocket API gateway service. At the client side the gatling report starts showing unexpectedly closed WS connections (with status code 1006) and the number of such connections seems to have a strong correspondence to the number of "got stuck" WsSession objects on the Tomcat WebSocket app's heap. That WsSession objects are preserved indefinitely and hence cannot be garbage-collected. Expected Results: ----------------- WsSession objects representing abnormally closed WebSocket connections shall eventually be the subject of garbage collection on the JVM heap. Build Date & Hardware: ---------------------- Build 2020-10-26 on Windows Server 2016 Standard (Version 1607 - OS Build 14393.3930) Additional Builds and Platforms: -------------------------------- N/A Additional Information: ----------------------- There is an attachment (tomcat-ws-api-gw-sockettimeoutexception-stack-trace.txt) to show the stack trace produced when SocketTimeoutException is encountered. Another attachment (tomcat-wssession-gc-root.png) contains the relevant prt of the heap dump created after a 400-user gatling load execution. Searching for "websocketsession" objects will bring up the preserved WebSocket session objects and checking the GC root of such a session object can also show the object reference chain up to the "waitingProcessors" map present in Http11NioProtocol. -- You are receiving this mail because: You are the assignee for the bug. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org