Hi Aleks,

On Sun, Mar 15, 2026 at 03:12:47PM +0100, Aleksandar Lazic wrote:
> Hi,
> 
> With help of AI have I seen some small improvements for haterm. As I fully
> understand when this patches are not accepted because they were created with
> AI help.
> 
> Now what I have observed when I'm started to benchmark `haterm` locally and
> in a small HTTP lab, focusing on direct H1/H2/H3 behaviour with larger
> responses.
> 
> Based on these measurements, I split a small patch series that tries to reduce
> response-path overhead in `haterm` without changing its overall role as a
> lightweight test server.
> 
> I compared `haterm` before and after the patch series using simple local A/B
> tests.
> 
> Two local images were built from the same source tree:
> 
> - baseline: unmodified `haterm`
> - patched: `haterm` with the patch series applied
> 
> Both were built with the same AWS-LC / QUIC-capable build path.
> 
> For each run, I started one local `haterm` instance and drove it directly with
> the same `h2load`-based client setup. The same ports, TLS material, SNI and
> request shape were used on both sides.
> 
> The workload was intentionally simple:
> 
> - GET requests
> - response size: 256 kB
> - direct measurements against `haterm`
> - protocols tested separately:
>   - HTTP/1.1
>   - HTTP/2
>   - HTTP/3
> 
> For each protocol, I compared:
> 
> - requests per second
> - request latency
> - coarse container CPU samples
> - coarse container memory samples
> 
> The H2/H3 tests used the same TLS/SNI/ALPN settings in both cases.
> 
> Observed A/B result
> ===================
> 
> HTTP/1.1
> 
> In a repeated local smoke run with 8000 requests:
> 
> - baseline: 17499.88 req/s
> - patched:  21390.95 req/s
> 
> That is roughly a +22% throughput improvement.
> 
> Mean request latency also moved slightly down:
> 
> - baseline: 15.11 ms
> - patched:  14.35 ms
> 
> HTTP/2
> 
> In a local smoke run with 4000 requests:
> 
> - baseline: 13745.14 req/s
> - patched:  14191.24 req/s
> 
> That is roughly a +3.2% throughput improvement.
> 
> Mean request latency moved slightly down:
> 
> - baseline: 17.89 ms
> - patched:  17.50 ms
> 
> Coarse container samples during that run were approximately:
> 
> - CPU: 29.85% -> 28.89%
> - memory: 91.54 MiB -> 92.09 MiB
> 
> HTTP/3
> 
> In a local smoke run with 4000 requests:
> 
> - baseline: 8934.47 req/s
> - patched:  9221.99 req/s
> 
> That is roughly a +3.2% throughput improvement.
> 
> Mean request latency moved slightly down:
> 
> - baseline: 47.51 ms
> - patched:  46.13 ms
> 
> Coarse container samples during that run were approximately:
> 
> - CPU: 84.70% -> 82.08%
> - memory: 132.50 MiB -> 130.20 MiB
> 
> I do not want to overstate the exact percentages because these were local
> smoke-style A/B tests, not long benchmark campaigns.
> 
> Still, the direction was consistent enough to justify the series:
> 
> - the H1 gain was clear
> - H2 and H3 improved modestly
> - H3 CPU/memory also moved slightly in the right direction
> 
> The measurements are consistent with reduced response-path overhead from:
> 
> - removing `snprintf()` from `hstream_build_http_resp()`
> - reporting `/?t=` in the generated headers
> - increasing the prebuilt response buffer size
> - batching payload filling so larger responses need fewer refill cycles
> 
> 
> Local smoke-test commands
> =========================
> 
> The local A/B smoke tests used the following commands.
> 
> Start haterm
> ------------
> 
> ```bash
> podman run -d --rm --name haterm-smoke-new --network host \
>   -v /datadisk/git-repos/server-benchmark/tls:/mnt:ro \
>   localhost/bench-hap-own-local:latest /usr/local/sbin/haterm \
>     -L "127.0.0.1:18089" \
>     -F "bind [email protected]:18452 ssl crt /mnt/combined.pem alpn h3" \
>     -F "bind 127.0.0.1:18452 ssl crt /mnt/combined.pem alpn h2"
> ```
> 
> For the baseline run, only the image name changed:
> 
> localhost/haterm:latest
> 
> HTTP/1.1 smoke test
> ===================
> ```bash
> podman run --rm --network host localhost/h2load:latest \
>   --h1 -n 4000 -c 50 -t 4 -m 10 \
>   "http://127.0.0.1:18089/?s=256k";
> ```
> 
> HTTP/2 smoke test
> =================
> 
> ```bash
> podman run --rm --network host \
>   -e SSL_CERT_FILE=/mnt/ca.crt \
>   -v /datadisk/git-repos/server-benchmark/tls:/mnt:ro \
>   localhost/h2load:latest \
>   --connect-to=127.0.0.1:18452 \
>   --sni=bench.local \
>   --alpn-list=h2 \
>   -n 4000 -c 50 -t 4 -m 10 \
>   "https://bench.local:18452/?s=256k";
> ```
> 
> HTTP/3 smoke test
> =================
> 
> ```bash
> podman run --rm --network host \
>   -e SSL_CERT_FILE=/mnt/ca.crt \
>   -v /datadisk/git-repos/server-benchmark/tls:/mnt:ro \
>   localhost/h2load:latest \
>   --connect-to=127.0.0.1:18452 \
>   --sni=bench.local \
>   --alpn-list=h3 \
>   -n 4000 -c 50 -t 4 -m 10 \
>   "https://bench.local:18452/?s=256k";
> Repeated HTTP/1.1 spot check
> podman run --rm --network host localhost/h2load:latest \
>   --h1 -n 8000 -c 50 -t 4 -m 10 \
>   "http://127.0.0.1:18089/?s=256k";
> ```
> 
> 
> Notes
> =====
> + The payload size was always /?s=256k.
> + H2/H3 used the same local CA and the same
>   SNI (bench.local) in both baseline and patched runs.
> +The same ports and TLS material were used in all A/B comparisons.
> 
> 
> Patches
> =======
> 
> The series is split into small steps:
> 
> 1. use chunk builders for generated response headers
> 2. report the requested wait time in generated headers
> 3. increase the size of prebuilt response buffers
> 4. add a helper to fill HTX data in batches
> 5. switch the response path to the batched fill helper
> 
> Comments welcome, especially on whether this looks like a reasonable direction
> for `haterm`.

Thanks for your work and your measurements.

This morning I had a look at your patch series and gave it a try on
our local lab (ARM and AMD). I'm seeing mixed results.

A few things in random order:
  - it's great that you got rid of that nasty snprintf(), I did the same
    on httpterm last year and gained a request rate percentage in the two
    digits. However this will not be measurable with 256k responses since
    the overhead of such a call compared to sending 256k is negligible.
    But that was on my radar as something to get rid of, so I'm grateful
    that you did it.

  - the time measurement is not correct actually, it reports the requested
    time while the purpose was to indicate the generation time. It's useful
    when you don't know if you're measuring haterm's internal latency or
    network latency. I've uesd this a lot with httpterm in the past, where
    latencies of serveral milliseconds could happen on a saturated machine,
    and seeing the server denounce itself as the culprit was definitely
    helpful!

  - for the change on the RESPSIZE from 16kB to 128kB, I'm observing
    different results:
      - on the AMD, it's worse by a few percent (~2%). My guess is
        that it causes more L3 cache thrashing and that since this
        machine has a limited memory bandwidth (~35 GB/s), the larger
        worksize has a negative impact.

      - on the ARM, it's slightly better by ~2%. This machine has
        130 GB/s of memory bandwidth, which can easily amortize the
        extra RAM accesses and benefit from the slightly reduced
        scheduling.

      - on both machines, reducing the response size to 32kB and using
        tune.bufsize 65536 gives a huge boost (and only this combination).
        On the AMD, it's jumping from 167 to 269 Gbps (+61%). On the
        ARM, it's jumping from 397 to 605 Gbps (+52%). Note, this was on
        H1, which for now remains the only one we can reliably monitor.
        Even SSL benefits from this, even though less due to crypto.

  - the last patch creating the loop to try to better fill the target
    buffer should theoretically not change anything, yet it does. On
    the AMD it degrades the performance by an extra 2-3%, while on the
    ARM it brings roughly 3%.

All this makes me think that we're facing a scheduling issue: there's
apparently one combination (bufsize 32k + 64k respsize) which gives the
best performance, most likely because it's the largest chunk that can
be copied at once in L1 and allows all copies to remain cache-line
aligned, but that's speculation. The data are then not too large to
leave in one go while preserving TSO capabilities. Also, the fact that
the gain is so high on both architectures is not a coincidence, it's
not something just due do the CPU architecture but to the software
architecture. Also, I'm wondering why there's a change when you loop
over the HTX since in my opinion it ought to fill what it can at once,
and this alone deserves investigation and might help respond to the
first point.

If you want I can already merge your first patch (snprinf) as it's
definitely useful.

Thank you!
Willy


Reply via email to