https://bz.apache.org/bugzilla/show_bug.cgi?id=64848

            Bug ID: 64848
           Summary: WsSession objects in OUTPUT_CLOSED state are
                    implicitly held by waitingProcessors and GC cannot
                    purge them from the JVM heap
           Product: Tomcat 9
           Version: 9.0.36
          Hardware: PC
            Status: NEW
          Severity: major
          Priority: P4
         Component: WebSocket
          Assignee: dev@tomcat.apache.org
          Reporter: laszlo.peter.karo...@gmail.com
  Target Milestone: -----

Created attachment 37534
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=37534&action=edit
Attachment with the 3 services that help reproduce the issue +
SocketTimeoutException stack trace and WsSession GC root snapshot from a heap
dump

Overview:
---------

WebSocket session objects (represented as WsSession) "get stuck" on the heap
under heavy load in case Tomcat acts as a WebSocket API gateway between a
client and a server application.
By assuming a system configuration where the Tomcat WebSocket API gateway
service is deployed along with a server application on the same machine but the
client is launched elsewhere then the network latency can impose a considerable
overhead on the entire client-side data processing compared to the server side
where the data can be generated and transferred to the gateway service much
faster than the client can consume it and this can lead to the classic fast
producer-slow consumer phenomenon.

If the network latency goes beyond a certain threshold and the client cannot
keep up in a timely manner with the data flow coming from the gateway service
then Tomcat starts throwing SocketTimeoutException (by default after 20
seconds) at the WebSocket sessions for which the data weren't transmitted in
time. Such a timeout may end up in an abnormally closed WebSocket connection
(usually represented with 1006 status code at the client side) and even though
the corresponding sessions are moved into the OUTPUT_CLOSED state at Tomcat
level, they are still kept on the JVM heap endlessly by preventing the GC to
purge them out consequently producing a slow memory leak.

Steps to Reproduce:
-------------------

Reproducing such a situation is a bit cumbersome in terms of the required
hardware configuration as it needs two machines/VMs: one for the client and
another for the Tomcat WebSocket API gateway + server application. On the other
hand the client app should be hosted "far" from the Tomcat WebSocket app in
terms of network distance, i.e. if it connects to the Tomcat WebSocket app via
VPN then the network latency can be enough to reproduce the issue.
Alternatively in the Tomcat WebSocket app the
org.apache.tomcat.websocket.BLOCKING_SEND_TIMEOUT property can also be adjusted
as part of a customized RequestUpgradeStrategy to simulate a slow network.
The overall system demonstrates a simple distributed client-server application
inserting the Tomcat WebSocket API GW as an intermediary. The client can send a
number to the server that denotes a length (given in KBs) so the server will
respond with a random alphanumeric string having a length specified by the
given number. The Tomcat WebSocket API GW just routes the WS traffic back and
forth between the other two services.

1. Run the attached
random-string-ws-provider-undertow-1.0.0-SNAPSHOT-jar-with-dependencies
application (the server app) in a form of

java -jar
random-string-ws-provider-undertow-1.0.0-SNAPSHOT-jar-with-dependencies <host>
<port>

By default it configures the underlying Undertow webserver to launch on
localhost and listen on port #8193.

2. Run the attached ws-api-gateway-tomcat-1.0.0-SNAPSHOT.jar application (the
Tomcat WebSocket API GW app) in a form of

java -jar ws-api-gateway-tomcat-1.0.0-SNAPSHOT.jar

By default it listens on port #8444 and it can be overridden by setting the
server.port property.
If the Undertow server app runs with non-default host and port configurations
then this needs to be reflected here by specifying the
zuul.routes.random-string-websocket-provider.url property accordingly, e.g.:

java -jar
-Dzuul.routes.random-string-websocket-provider.url=http://<another-host>:<another-port>
ws-api-gateway-tomcat-1.0.0-SNAPSHOT.jar

3. Run the attached ws-random-string-gatling-load-test application (the client
app wrapped into gatling to generate artifical load) in a form of

mvn clean -B compile exec:java -Dexec.mainClass=RandomStringWebSocketRequestApp
-DrampUpUsers=<number-of-concurrent-users> -DrampUpTime=1
-DserverUrl=<host-where-other-two-services-run>
-DserverPort=<port-where-Tomcat-API-GW-listens-on>
-Dgatling.simulationClass=com.acme.wsrequest.simulation.RandomStringWebSocketRequestSimulation
-DrandomStringLengthInKb=1000

Actual Results:
---------------

Running the client app with 400 users will start producing the
SocketTimeoutException confidently in the Tomcat WebSocket API gateway service.
At the client side the gatling report starts showing unexpectedly closed WS
connections (with status code 1006) and the number of such connections seems to
have a strong correspondence to the number of "got stuck" WsSession objects on
the Tomcat WebSocket app's heap. That WsSession objects are preserved
indefinitely and hence cannot be garbage-collected.

Expected Results:
-----------------

WsSession objects representing abnormally closed WebSocket connections shall
eventually be the subject of garbage collection on the JVM heap.

Build Date & Hardware:
----------------------

Build 2020-10-26 on Windows Server 2016 Standard (Version 1607 - OS Build
14393.3930)

Additional Builds and Platforms:
--------------------------------

N/A

Additional Information:
-----------------------

There is an attachment
(tomcat-ws-api-gw-sockettimeoutexception-stack-trace.txt) to show the stack
trace produced when SocketTimeoutException is encountered.

Another attachment (tomcat-wssession-gc-root.png) contains the relevant prt of
the heap dump created after a 400-user gatling load execution. Searching for
"websocketsession" objects will bring up the preserved WebSocket session
objects and checking the GC root of such a session object can also show the
object reference chain up to the "waitingProcessors" map present in
Http11NioProtocol.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

Reply via email to