Hi Willy,

thank you for this response. The behaviour in CouchDB is ancient (12 years 
plus, essentially since before the 1.0 release), and yes it is clearly a bit 
naughty, though it has also worked up to this point for us.

The reason I raised this here is because it seemed inadvertent given the 
bisect, but I completely accept that this isn't something you should "fix" in 
HAProxy as it is not broken behaviour. You are also of course right that many 
other things could cause the same issue; they just don't happen to in our 
particular setup.

We'll take this onboard and think of next steps. A simple option is to simply 
disable compression on the responses to these requests, since we can easily 
identify which would be affected. All other endpoints, many of which send 
chunked responses, make no assumptions about the timing of the individual 
chunks (just as long as you get them all in the right order).

Regards,
Robert Newson

On Fri, 23 Jun 2023, at 11:14, Willy Tarreau wrote:
> Hi Robert,
>
> On Fri, Jun 23, 2023 at 11:01:30AM +0100, Robert Newson wrote:
>> Hi,
>> 
>> We use HAProxy in front of Apache CouchDB. CouchDB has an endpoint with some
>> interesting characteristics called _changes. With certain parameters, that
>> are commonly used, the response is effectively endless, streaming the updates
>> to a database as they happen using chunked transfer mode. In periods where no
>> changes occur a 'heartbeat' is periodically sent to keep the connection
>> alive. This heartbeat is small, consisting of a single newline character.
>> Each heartbeat is sent as a complete chunk.
>> 
>> We found that HAProxy 2.6, if slz is compressing the response, will only pass
>> on these chunks to the client when three have arrived from the backend. The
>> delay in receiving these chunks at the client causes them to timeout. If we
>> USE_ZLIB=1 instead, the problem does not occur and the heartbeat chunks are
>> sent in a timely manner.
>
> This can depend on a lot of factors. In the past we used to see ZLIB
> buffer up to 32 or 256 kB (I don't remember which one to be honest)
> without delivering anything if the content was repetitive for example,
> this was optimal, but slz is not capable of doing this.
>
>> We recently upgraded from 2.0.31 to 2.6.13 which introduced this unexpected
>> change. We eventually tracked this down to a change in libslz. With 2.0.31 we
>> were building libslz 1.0.0. With 2.6.13 we used the in-tree version. We
>> bisected haproxy from 2.0 to 2.6 and found that the switch to in-tree slz was
>> the trigger point between good and bad behaviour. We then further bisected
>> libslz between v1.0.0 and b06c172 (the point at which it was merged in-tree).
>> This pointed at
>> http://git.1wt.eu/web?p=libslz.git;a=commit;h=b334e7ad57727cdb0738b135bb98b65763d758b5
>> as the exact moment the delay was introduced.
>
> Interesting, previously it would buffer at most 3 bytes and now at most
> 7. SLZ doesn't keep input context so you have guarantees that pushing
> more input will eventually produce some output. However it will of
> course be slower or faster depending on how large and how compressible
> the input contents are.
>
>> In summary, we think HAProxy should not indefinitely buffer a complete chunk
>> of a chunked transfer encoding just because it is below the level deemed
>> worthy of compression and should instead send it to the client in a timely
>> manner.
>
> But you're aware that what you're asking for is a direct violation of
> basic HTTP messaging rules stating that no agent may depend on chunk
> delivery due to anything along the chain possibly having to buffer some
> of the data for analysis or transformation. I just found it now, it's
> RFC9110#7.6:
>
>   https://www.rfc-editor.org/rfc/rfc9110
>
>   An HTTP message can be parsed as a stream for incremental processing or
>   forwarding downstream. However, senders and recipients cannot rely on
>   incremental delivery of partial messages, since some implementations will
>   buffer or delay message forwarding for the sake of network efficiency,
>   security checks, or content transformations.
>
> Note: this text was already in RFC7230 9 years ago so it's not something
> new.
>
>> This affects clients using http/1.1 or http/2. Noting that Apache CouchDB
>> only supports http/1.1.
>> 
>> We can work around this by disabling compression for this endpoint, though
>> this is not our preference.
>
> In fact via many components, even without compression you can face issues.
> For example the TCP stack by default will refrain from sending incomplete
> frames for up to 200ms, meaning that even via a pure forwarder you may
> observe up to 200ms extra delay (or maybe 400ms round-trip if you're
> implementing sort of an application-layer ping).
>
> Also I'm a bit puzzled, because you're essentially describing a mechanism
> which should consist in *not* compressing repetitive bytes for the sake
> of latency, which is exactly the opposite of compression. I even remember
> that we mentioned recently that we'd need to improve the compression by
> only compressing full buffers (something which we don't do for now), so
> it would mean that you'd have to send many such heartbeats in this case.
>
> Maybe we could find a tradeoff via an option that would forcefully flush
> output after compressing available data, but it would only work around
> this exact observation you've made and would be no guarantee that no
> other form of buffering would happen anywhere else in the chain. Thus
> I feel like we're trying to fix symptoms here, and this reminds me of
> all the discussions 15 years ago about HTTP not being suitable at all
> for interactive transfers that finally led to WebSocket being created
> for this purpose.
>
> Just a question, is this feature something very recent and new in this
> service or has it been there for a while ? Because if it's new it might
> still be time to fix it to comply with HTTP. If it's legacy by now, it
> means it will participate to the ossification of HTTP and might definitely
> require a specific option to try to work around the design issue. It would
> not be the first time we have to work around existing issues but you will
> easily understand that we prefer to avoid starting down that route first
> if there can be other options; it's never good for the long term and
> tends to fool users into using unreliable mechanisms that regularly
> break or cause trouble depending on the environment they're deployed :-/
>
> Thanks,
> Willy

Reply via email to