https://bz.apache.org/bugzilla/show_bug.cgi?id=69982

            Bug ID: 69982
           Summary: TLS: large response write may stall after client TCP
                    window closes and only complete on keepAliveTimeout
           Product: Tomcat 11
           Version: 11.0.18
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: regression
          Priority: P2
         Component: Connectors
          Assignee: [email protected]
          Reporter: [email protected]
  Target Milestone: -------

I believe this is a regression in Tomcat 11’s TLS response write/flush
handling. After upgrading from Tomcat 10.1.52 to Tomcat 11.x (reproduced with
11.0.0 through 11.0.18), I intermittently see large HTTPS servlet responses
stall after most of the body has been sent: the client receives the initial
portion promptly, but the remaining bytes are not flushed until Tomcat’s
keepAliveTimeout expires, at which point the rest of the response is sent and
the connection is closed. I can reproduce this with both JSSE and OpenSSL/APR,
but not without TLS, and I cannot reproduce it on Tomcat 10.1.52 under the same
conditions. The issue appears to occur when the server fills the client’s TCP
receive window while writing a larger response.


----------


When upgrading an application from Tomcat 10.1.52 to Tomcat 11.0.18, I noticed
the following regression:

When using TLS... Occasionally, Tomcat 11 does not flush the entire servlet
response buffer after writing.  More specifically:

1. the client sends the request to tomcat
2. the server processes the request and attempts to send a response
3. the client receives most-but-not-all of the response body immediately
4. the client waits for the remaining portion of the response body
5. Eventually, Tomcat's configured keepAliveTimeout elapses
6. Tomcat flushes the remaining response
7. the client receives the remaining portion of the response body
8. Tomcat closes the connection.

I have reproduced this problem with Tomcat 11.0.0 and Tomcat 11.0.18, and a few
versions between.  I cannot reproduce this problem with Tomcat 10.1.52.

The full response body in this case is 109,886 bytes in cleartext (before being
encrypted for TLS).  The client receives the first 99,327 cleartext bytes (note
this is 1 byte fewer than 97KiB) in step 3, and the remaining bytes in step 7. 
 The encrypted content on the wire is larger, but I haven't measured exactly.

This pause only happens occasionally.  Not every time.  I can usually reproduce
it within 10 minutes when sending one request every second, to an isolated
server receiving no other traffic.

I can reproduce the problem with both JSSE and OpenSSL (via tomcat-native and
APR).
I cannot reproduce the problem without TLS.

------------

To reproduce the problem, I have a small Spring Boot WebFlux application that
uses embedded tomcat.

I originally noticed the problem when upgrading from
Spring Boot 3.5.11 (which uses Tomcat 10.1.52 by default) to
Spring Boot 4.0.3 (which uses Tomcat 11.0.18 by default).

To rule out any problems potentially introduced by Spring Boot 4, I tried
Spring Boot 3.5.11 with Tomcat 11, and was also able to reproduce the problem.
More specifically:
* Spring Boot 3.5.11 + Tomcat 10.1.52 => problem does not occur
* Spring Boot 3.5.11 + Tomcat 11.0.18 => problem occurs
* Spring Boot 4.0.3 + Tomcat 11.0.18  => problem occurs

Since the only difference between the first two configurations is the Tomcat
version used, I believe this is a regression introduced in Tomcat, not in
Spring.

The app and the packet captures mentioned below can be found at:

https://github.com/philsttr/tomcat_11_delayed_response


-----------

To debug this further, I have taken tcpdumps on the server running Tomcat for
both Tomcat 11 and Tomcat 10.1 to look for differences.  The packet captures
are available at: https://github.com/philsttr/tomcat_11_delayed_response

In the Tomcat 11 tcpdump (tomcat11.0.18.pcap), there are a bunch of successes,
followed by one failure. (I'm calling the slow responses "failures" in this
case).

Occasionally, I see "[TCP window specified by the receiver is now completely
full]" while the server is sending the response, indicating the server has
filled up the client's TCP receive window.
The client sends ACKs almost immediately after, but in the failure case, the
server doesn't send the remaining response even after the client ACKs.
Eventually, the server sends the remaining response after the keepAliveTimeout
elapses, as mentioned in step 5 above.

When I see the TCP window full event, sometimes the response is delayed, and
sometimes not. However, I have never seed a delayed response without the TCP
window full event. So the delayed response could be related to the TCP window
full event.

The failure case is toward the end of the tomcat11.0.18.pcap.  There is a 20s
pause (the configured keepAliveTimeout) between tcp frames 4263 and 4266.


In the Tomcat 10.1 tcpdump (tomcat10.1.52.pcap), I only see successes.   I do
see multiple instances of "[TCP window specified by the receiver is now
completely full]" while the server is sending the response.  However, Tomcat
10.1 always sends the remaining portion of the response immediately after the
client ACKs.  i.e. Tomcat 10.1 does not wait for the keepAliveTimeout to elapse
before sending the remaining response.


-----------

I've attempted additional debugging, but this problem seems somewhat "quantum"
in nature.  Any time I try to "observe" it with additional logging or
breakpoints in a debugger, the problem no longer occurs.  My guess is that
there is a race condition that is affected by the additional time of the
logging / debugging.  Perhaps it is preventing the client's TCP receive window
from filling (I haven't checked this).

I can, however, observe steps 5-8 in a debugger.

The flush from step 6 occurs in SecureNioChannel.close

https://github.com/apache/tomcat/blob/8c4df4522f2af66efa45844ebd33a3925db23e75/java/org/apache/tomcat/util/net/SecureNioChannel.java#L551

I can see that when SecureNioChannel.close is invoked, the netOutBuffer
contains the remaining portion of the response body that was not flushed
previously.


------------

To summarize, I can reproduce the problem only with this combination
* Tomcat 11  (specifically tried 11.0.0 and 11.0.18 and a few between)
* TLS (either JSSE or OpenSSL)
* somewhat large response payload.  In this case, 109,886 bytes in cleartext
* server fills the client's TCP receive window when sending the response
* no additional trace logging or breakpoints configured

I cannot reproduce the problem with any of the following:
* Tomcat 10.1.52
* non-TLS
* smaller response payloads
* server doesn't fill the client's TCP receive window
* additional trace logging or breakpoints configured

I'm running Tomcat on Ubuntu 24.04.4 LTS.
On the client side, I'm using curl 8.5.0 on Ubuntu 24.04.4 LTS.
I have not tested on other OSes or curl versions.

I'm running Eclipse Temurin 25.0.2+10.  I have not tried other Java versions.


--------------

This problem is extremely tricky.  I imagine it will be hard for you to
reproduce it. I'll help in any way I can.  I can try snapshots, or any changes
you want to test.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to