On 2019-02-19 06:47, Willy Tarreau wrote:

This is interesting. As you observed in the trace you sent me, the
lighttpd server closes just after sending the response headers. This
indeed matches the "SD" log that aproxy emits. If it doesn't happen
in TCP mode nor with Nginx, it means that something haproxy modifies
in the request causes this effect on the server.

Hi

I'm sending answer from colleague who investigated this more thoroughly, especially from lighttpd side:

we've been debugging this a bit further and it does not look like the issue with the seemingly random incomplete HTTP responses would be due to any particular request headers at the HTTP layer. It rather looks like something at the TCP level (so specific to HTTP mode):

A first observation we made is that the frequency of these incomplete transfers increases when we add a delay at the backend server after sending the response headers and before sending the body data. We added a 100 ms delay there and then got a lot of interrupted transfers that had only received the response headers (= no delay) but 0 bytes of the body (= which was sent just after delay). So the frequency with which this happens appears to be proportional to latencies/stalls in the backend server sending the response data (some read timeout logic at haproxy??).

We debugged further and noticed that in all cases where transfers were incomplete our lighttpd backend server was receiving an EPOLLRDHUP event on the socket where it communicates with haproxy. So it appears as if haproxy is *sometimes* (apparently depending on some read latency/stall - see above) shutting down its socket with the backend for writing *before* the full response and body data has been received.

And this is also basically ok because the socket remains writeable for lighttpd and so it could still send down the rest of the response data. However, it looks like lighttpd is not expecting this kind of behavior from the client and is not correctly handling such a half-closed TCP session. There is code in lighttpd to handle such a EPOLLRDHUP event and half-closed TCP connection, but lighttpd then also checks the state of the TCP session with getsockopts and keeps the connection open *only* when the state is TCP_CLOSE_WAIT. In all other cases upon receiving the EPOLLRDHUP it actively changes the state of the connection to "ERROR" and then closes the connection:

https://github.com/lighttpd/lighttpd1.4/blob/master/src/connections.c#L908
https://github.com/lighttpd/lighttpd1.4/blob/master/src/fdevent.c#L995

We checked and every time we have a incomplete response lighttpd receives the EPOLLRDHUP event on the socket but the tcp state queried via getsockopts is always TCP_CLOSE (and not TCP_CLOSE_WAIT as lighttpd seems to expect). And because of this lighttpd then actively closes the half-closed connection also from its end (which likely is the cause of the TCP FIN sent by lighttpd as seen in the tcpdump).

When we remove this condition from lighttpd which marks the connection as errorness in case of EPOLLRDHUP and tcp state != TCP_CLOSE_WAIT, then the problem with the incomplete transfers disappears:

https://github.com/lighttpd/lighttpd1.4/blob/master/src/connections.c#L922

We do not understand why this is or what the correct reaction to the EPOLLRDHUP event should be. In particular, we do not understand why lighttpd performs this check for TCP_CLOSE_WAIT or why we always get a state of TCP_CLOSE when we receive this event but the socket still continues to be writeable (so does the TCP_CLOSE just indicate that one direction of the connection is closed??). Still, because this half-closing of the connection to the backed server appears to happen just pretty randomly and depending on latency/stalls of the backend server sending down the response data, we assume that this is not the intended behavior by haproxy (and so possibly indicates some bug in haproxy too).

We assume that the reason why direct requests to the backend server or requests proxied via Nginx did never fail is because in these cases there never occurs the EPOLLRDHUP event and there never are half-closed connections. However, we have not tested this (yet), so we did not re-test with Nginx to verify that then indeed lighttpd never sees a EPOLLRDHUP.

Any ideas or suggestions based on these findings what should be the proper solution to the problem?

Thank you.

Reply via email to