[RFC] haterm: reduce response-path overhead for large responses

Aleksandar Lazic Sun, 15 Mar 2026 07:14:40 -0700

Hi,

With help of AI have I seen some small improvements for haterm. As I fullyunderstand when this patches are not accepted because they were created with AIhelp.

Now what I have observed when I'm started to benchmark `haterm` locally and in asmall HTTP lab, focusing on direct H1/H2/H3 behaviour with larger responses.


Based on these measurements, I split a small patch series that tries to reduce
response-path overhead in `haterm` without changing its overall role as a
lightweight test server.

I compared `haterm` before and after the patch series using simple local A/B
tests.

Two local images were built from the same source tree:

- baseline: unmodified `haterm`
- patched: `haterm` with the patch series applied

Both were built with the same AWS-LC / QUIC-capable build path.

For each run, I started one local `haterm` instance and drove it directly with
the same `h2load`-based client setup. The same ports, TLS material, SNI and
request shape were used on both sides.

The workload was intentionally simple:

- GET requests
- response size: 256 kB
- direct measurements against `haterm`
- protocols tested separately:
  - HTTP/1.1
  - HTTP/2
  - HTTP/3

For each protocol, I compared:

- requests per second
- request latency
- coarse container CPU samples
- coarse container memory samples

The H2/H3 tests used the same TLS/SNI/ALPN settings in both cases.

Observed A/B result
===================

HTTP/1.1

In a repeated local smoke run with 8000 requests:

- baseline: 17499.88 req/s
- patched:  21390.95 req/s

That is roughly a +22% throughput improvement.

Mean request latency also moved slightly down:

- baseline: 15.11 ms
- patched:  14.35 ms

HTTP/2

In a local smoke run with 4000 requests:

- baseline: 13745.14 req/s
- patched:  14191.24 req/s

That is roughly a +3.2% throughput improvement.

Mean request latency moved slightly down:

- baseline: 17.89 ms
- patched:  17.50 ms

Coarse container samples during that run were approximately:

- CPU: 29.85% -> 28.89%
- memory: 91.54 MiB -> 92.09 MiB

HTTP/3

In a local smoke run with 4000 requests:

- baseline: 8934.47 req/s
- patched:  9221.99 req/s

That is roughly a +3.2% throughput improvement.

Mean request latency moved slightly down:

- baseline: 47.51 ms
- patched:  46.13 ms

Coarse container samples during that run were approximately:

- CPU: 84.70% -> 82.08%
- memory: 132.50 MiB -> 130.20 MiB

I do not want to overstate the exact percentages because these were local
smoke-style A/B tests, not long benchmark campaigns.

Still, the direction was consistent enough to justify the series:

- the H1 gain was clear
- H2 and H3 improved modestly
- H3 CPU/memory also moved slightly in the right direction

The measurements are consistent with reduced response-path overhead from:

- removing `snprintf()` from `hstream_build_http_resp()`
- reporting `/?t=` in the generated headers
- increasing the prebuilt response buffer size
- batching payload filling so larger responses need fewer refill cycles


Local smoke-test commands
=========================

The local A/B smoke tests used the following commands.

Start haterm
------------

```bash
podman run -d --rm --name haterm-smoke-new --network host \
  -v /datadisk/git-repos/server-benchmark/tls:/mnt:ro \
  localhost/bench-hap-own-local:latest /usr/local/sbin/haterm \
    -L "127.0.0.1:18089" \
    -F "bind [email protected]:18452 ssl crt /mnt/combined.pem alpn h3" \
    -F "bind 127.0.0.1:18452 ssl crt /mnt/combined.pem alpn h2"
```

For the baseline run, only the image name changed:

localhost/haterm:latest

HTTP/1.1 smoke test
===================
```bash
podman run --rm --network host localhost/h2load:latest \
  --h1 -n 4000 -c 50 -t 4 -m 10 \
  "http://127.0.0.1:18089/?s=256k";
```

HTTP/2 smoke test
=================

```bash
podman run --rm --network host \
  -e SSL_CERT_FILE=/mnt/ca.crt \
  -v /datadisk/git-repos/server-benchmark/tls:/mnt:ro \
  localhost/h2load:latest \
  --connect-to=127.0.0.1:18452 \
  --sni=bench.local \
  --alpn-list=h2 \
  -n 4000 -c 50 -t 4 -m 10 \
  "https://bench.local:18452/?s=256k";
```

HTTP/3 smoke test
=================

```bash
podman run --rm --network host \
  -e SSL_CERT_FILE=/mnt/ca.crt \
  -v /datadisk/git-repos/server-benchmark/tls:/mnt:ro \
  localhost/h2load:latest \
  --connect-to=127.0.0.1:18452 \
  --sni=bench.local \
  --alpn-list=h3 \
  -n 4000 -c 50 -t 4 -m 10 \
  "https://bench.local:18452/?s=256k";
Repeated HTTP/1.1 spot check
podman run --rm --network host localhost/h2load:latest \
  --h1 -n 8000 -c 50 -t 4 -m 10 \
  "http://127.0.0.1:18089/?s=256k";
```


Notes
=====
+ The payload size was always /?s=256k.
+ H2/H3 used the same local CA and the same
  SNI (bench.local) in both baseline and patched runs.
+The same ports and TLS material were used in all A/B comparisons.


Patches
=======

The series is split into small steps:

1. use chunk builders for generated response headers
2. report the requested wait time in generated headers
3. increase the size of prebuilt response buffers
4. add a helper to fill HTX data in batches
5. switch the response path to the batched fill helper

Comments welcome, especially on whether this looks like a reasonable direction
for `haterm`.

Best regards

Aleks

From 97db1032ed2bad2de353966f8eff0be97b5cfd42 Mon Sep 17 00:00:00 2001
From: Aleksandar Lazic <[email protected]>
Date: Sun, 15 Mar 2026 14:48:56 +0100
Subject: [PATCH 5/5] OPTIM/MINOR: haterm: use the batched HTX fill helper for
 response payloads

Now that `hstream_add_data_batch()` is available, use it at the two
response payload fill sites in `hstream_build_http_resp()` and
`process_hstream()`.

This lets haterm push more body data per wakeup when the HTX buffer has
room for it, reducing the number of refill cycles for larger responses.

Signed-off-by: Aleksandar Lazic <[email protected]>
---
 src/haterm.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/haterm.c b/src/haterm.c
index c21db079ca..d7c8a7092d 100644
--- a/src/haterm.c
+++ b/src/haterm.c
@@ -633,8 +633,7 @@ static int hstream_build_http_resp(struct hstream *hs)
 		goto err;
 	}
 
-	if (hs->to_write > 0)
-		hstream_add_data(htx, hs);
+	hstream_add_data_batch(htx, hs);
 	if (hs->to_write <= 0)
 		htx->flags |= HTX_FL_EOM;
 	htx_to_buf(htx, buf);
@@ -915,8 +914,7 @@ static struct task *process_hstream(struct task *t, void *context, unsigned int
 		}
 
 		htx = htx_from_buf(buf);
-		if (hs->to_write > 0)
-			hstream_add_data(htx, hs);
+		hstream_add_data_batch(htx, hs);
 		if (hs->to_write <= 0)
 			htx->flags |= HTX_FL_EOM;
 		htx_to_buf(htx, &hs->res);
-- 
2.43.0

From 60094d358a36ab37f8159caef84de30ea280f957 Mon Sep 17 00:00:00 2001
From: Aleksandar Lazic <[email protected]>
Date: Sun, 15 Mar 2026 14:47:33 +0100
Subject: [PATCH 4/5] MINOR: haterm: add a helper to fill HTX data in batches

`hstream_add_data()` appends a single data fragment to the HTX response
buffer. For large responses, callers may benefit from filling the
available HTX data space in one go instead of re-entering the response
path repeatedly.

Add `hstream_add_data_batch()` as a small helper that repeatedly calls
`hstream_add_data()` while space remains in the HTX buffer and data is
still pending.

No call site is changed yet.

Signed-off-by: Aleksandar Lazic <[email protected]>
---
 src/haterm.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/src/haterm.c b/src/haterm.c
index 6e201e923c..c21db079ca 100644
--- a/src/haterm.c
+++ b/src/haterm.c
@@ -491,6 +491,22 @@ static void hstream_add_data(struct htx *htx, struct hstream *hs)
 	return;
 }
 
+/* Fill the HTX buffer with as much payload data as possible in one wakeup.
+ * This reduces the number of send / task wakeup cycles, which is especially
+ * expensive on the H3/QUIC path.
+ */
+static void hstream_add_data_batch(struct htx *htx, struct hstream *hs)
+{
+	unsigned long long before;
+
+	while (hs->to_write > 0 && htx_free_data_space(htx) > 0) {
+		before = hs->to_write;
+		hstream_add_data(htx, hs);
+		if (hs->to_write == before)
+			break;
+	}
+}
+
 /* Build the HTTP response with eventually some BODY data depending on ->to_write
  * value. Return 1 if succeeded, 0 if not.
  */
-- 
2.43.0

From 5a45d60cb22928c8228d6b893eef6d06311a332b Mon Sep 17 00:00:00 2001
From: Aleksandar Lazic <[email protected]>
Date: Sun, 15 Mar 2026 14:46:08 +0100
Subject: [PATCH 3/5] OPTIM/MINOR: haterm: increase the size of prebuilt
 response buffers

The prebuilt response buffers are currently limited to 16 kB. Large test
responses therefore require many more refill cycles than necessary.

Increase `RESPSIZE` to 128 kB so that larger responses can be copied out
in fewer chunks.

This does not change the generated content. It only changes the internal
buffer size used to serve it.

Signed-off-by: Aleksandar Lazic <[email protected]>
---
 src/haterm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/haterm.c b/src/haterm.c
index afe7be9af1..6e201e923c 100644
--- a/src/haterm.c
+++ b/src/haterm.c
@@ -53,7 +53,7 @@ const char *HTTP_HELP =
         "\n";
 
 /* Size in bytes of the prebuilts response buffers */
-#define RESPSIZE 16384
+#define RESPSIZE 131072
 /* Number of bytes by body response line */
 #define HS_COMMON_RESPONSE_LINE_SZ 50
 static char common_response[RESPSIZE];
-- 
2.43.0

From d05459f6e19d1562c8c0551c648e3820e7044e07 Mon Sep 17 00:00:00 2001
From: Aleksandar Lazic <[email protected]>
Date: Sun, 15 Mar 2026 14:41:11 +0100
Subject: [PATCH 2/5] MINOR: haterm: report the requested wait time in
 generated headers
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The `/?t=` URI parameter already controls the time spent waiting before a
response is sent, but the generated `X-req` and `X-rsp` headers still
report a fixed `time=0 ms`.

Use the parsed wait value from `hs->res_wait` and expose it in both
headers so that the generated metadata matches the configured response
delay, in case the `/?t=` URI parameter is not set is the default
vaöue 0, as before.

Signed-off-by: Aleksandar Lazic <[email protected]>
---
 src/haterm.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/src/haterm.c b/src/haterm.c
index 58cad33268..afe7be9af1 100644
--- a/src/haterm.c
+++ b/src/haterm.c
@@ -502,6 +502,7 @@ static int hstream_build_http_resp(struct hstream *hs)
 	unsigned int flags = HTX_SL_F_IS_RESP | HTX_SL_F_XFER_LEN | (!hs->req_chunked ?  HTX_SL_F_CLEN : 0);
 	struct htx_sl *sl;
 	char *end;
+	int wait_ms;
 
 	TRACE_ENTER(HS_EV_HSTRM_RESP, hs);
 
@@ -552,8 +553,17 @@ static int hstream_build_http_resp(struct hstream *hs)
 		goto err;
 	}
 
+	wait_ms = hs->res_wait == TICK_ETERNITY ? 0 : TICKS_TO_MS(hs->res_wait);
 	chunk_reset(&trash);
-	if (!chunk_strcat(&trash, "time=0 ms") ||
+	if (!chunk_strcat(&trash, "time=")) {
+		TRACE_ERROR("could not build x-req HTX header", HS_EV_HSTRM_RESP, hs);
+	    goto err;
+	}
+	end = ultoa_o(wait_ms, trash.area + trash.data, trash.size - trash.data);
+	if (!end)
+		goto err;
+	trash.data = end - trash.area;
+	if (!chunk_strcat(&trash, " ms") ||
 	    !htx_add_header(htx, ist("X-req"), ist2(trash.area, trash.data))) {
 		TRACE_ERROR("could not add x-req HTX header", HS_EV_HSTRM_RESP, hs);
 	    goto err;
@@ -588,7 +598,15 @@ static int hstream_build_http_resp(struct hstream *hs)
 	if (!end)
 		goto err;
 	trash.data = end - trash.area;
-	if (!chunk_strcat(&trash, ", time=0 ms (0 real)") ||
+	if (!chunk_strcat(&trash, ", time=")) {
+		TRACE_ERROR("could not build x-rsp HTX header", HS_EV_HSTRM_RESP, hs);
+	    goto err;
+	}
+	end = ultoa_o(wait_ms, trash.area + trash.data, trash.size - trash.data);
+	if (!end)
+		goto err;
+	trash.data = end - trash.area;
+	if (!chunk_strcat(&trash, " ms (0 real)") ||
 	    !htx_add_header(htx, ist("X-rsp"), ist2(trash.area, trash.data))) {
 		TRACE_ERROR("could not add x-rsp HTX header", HS_EV_HSTRM_RESP, hs);
 	    goto err;
-- 
2.43.0

From afec5218212213790ac1877396d9b02fa2ab656a Mon Sep 17 00:00:00 2001
From: Aleksandar Lazic <[email protected]>
Date: Sun, 15 Mar 2026 14:37:57 +0100
Subject: [PATCH 1/5] OPTIM/MINOR: haterm: use chunk builders for generated
 response headers

hstream_build_http_resp() currently uses snprintf() to build the
status code and the generated X-req/X-rsp header values.

These strings are short and are fully derived from already parsed request
state, so they can be assembled directly in the HAProxy trash buffer using
`chunk_strcat()` and `ultoa_o()`.

This keeps the generated output unchanged while removing the remaining
`snprintf()` calls from the response-building path.

No functional change is expected.

Signed-off-by: Aleksandar Lazic <[email protected]>
---
 src/haterm.c | 52 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 40 insertions(+), 12 deletions(-)

diff --git a/src/haterm.c b/src/haterm.c
index 98100f83ed..58cad33268 100644
--- a/src/haterm.c
+++ b/src/haterm.c
@@ -501,11 +501,14 @@ static int hstream_build_http_resp(struct hstream *hs)
 	struct htx *htx;
 	unsigned int flags = HTX_SL_F_IS_RESP | HTX_SL_F_XFER_LEN | (!hs->req_chunked ?  HTX_SL_F_CLEN : 0);
 	struct htx_sl *sl;
-	char hdrbuf[128];
+	char *end;
 
 	TRACE_ENTER(HS_EV_HSTRM_RESP, hs);
 
-	snprintf(hdrbuf, sizeof(hdrbuf), "%d", hs->req_code);
+	chunk_reset(&trash);
+	end = ultoa_o(hs->req_code, trash.area, trash.size);
+	if (!end)
+		goto err;
 	buf = hstream_get_buf(hs, &hs->res);
 	if (!buf) {
 		TRACE_ERROR("could not allocate response buffer", HS_EV_HSTRM_RESP, hs);
@@ -515,7 +518,7 @@ static int hstream_build_http_resp(struct hstream *hs)
 	htx = htx_from_buf(buf);
 	sl = htx_add_stline(htx, HTX_BLK_RES_SL, flags,
 	                    !(hs->ka & 4) ? ist("HTTP/1.0") : ist("HTTP/1.1"),
-	                    ist(hdrbuf), IST_NULL);
+	                    ist2(trash.area, end - trash.area), IST_NULL);
 	if (!sl) {
 		TRACE_ERROR("could not add HTX start line", HS_EV_HSTRM_RESP, hs);
 		goto err;
@@ -549,19 +552,44 @@ static int hstream_build_http_resp(struct hstream *hs)
 		goto err;
 	}
 
-	/* XXX TODO time?  XXX */
-	snprintf(hdrbuf, sizeof(hdrbuf), "time=%ld ms", 0L);
-	if (!htx_add_header(htx, ist("X-req"), ist(hdrbuf))) {
+	chunk_reset(&trash);
+	if (!chunk_strcat(&trash, "time=0 ms") ||
+	    !htx_add_header(htx, ist("X-req"), ist2(trash.area, trash.data))) {
 		TRACE_ERROR("could not add x-req HTX header", HS_EV_HSTRM_RESP, hs);
 	    goto err;
 	}
 
-	/* XXX TODO time? XXX */
-	snprintf(hdrbuf, sizeof(hdrbuf), "id=%s, code=%d, cache=%d,%s size=%lld, time=%d ms (%ld real)",
-	         "dummy", hs->req_code, hs->req_cache,
-			 hs->req_chunked ? " chunked," : "",
-			 hs->req_size, 0, 0L);
-	if (!htx_add_header(htx, ist("X-rsp"), ist(hdrbuf))) {
+	chunk_reset(&trash);
+	if (!chunk_strcat(&trash, "id=dummy, code=")) {
+		TRACE_ERROR("could not build x-rsp HTX header", HS_EV_HSTRM_RESP, hs);
+	    goto err;
+	}
+	end = ultoa_o(hs->req_code, trash.area + trash.data, trash.size - trash.data);
+	if (!end)
+		goto err;
+	trash.data = end - trash.area;
+	if (!chunk_strcat(&trash, ", cache=")) {
+		TRACE_ERROR("could not build x-rsp HTX header", HS_EV_HSTRM_RESP, hs);
+	    goto err;
+	}
+	end = ultoa_o(hs->req_cache, trash.area + trash.data, trash.size - trash.data);
+	if (!end)
+		goto err;
+	trash.data = end - trash.area;
+	if (hs->req_chunked && !chunk_strcat(&trash, ", chunked,")) {
+		TRACE_ERROR("could not build x-rsp HTX header", HS_EV_HSTRM_RESP, hs);
+	    goto err;
+	}
+	if (!chunk_strcat(&trash, " size=")) {
+		TRACE_ERROR("could not build x-rsp HTX header", HS_EV_HSTRM_RESP, hs);
+	    goto err;
+	}
+	end = ultoa_o(hs->req_size, trash.area + trash.data, trash.size - trash.data);
+	if (!end)
+		goto err;
+	trash.data = end - trash.area;
+	if (!chunk_strcat(&trash, ", time=0 ms (0 real)") ||
+	    !htx_add_header(htx, ist("X-rsp"), ist2(trash.area, trash.data))) {
 		TRACE_ERROR("could not add x-rsp HTX header", HS_EV_HSTRM_RESP, hs);
 	    goto err;
 	}
-- 
2.43.0

[RFC] haterm: reduce response-path overhead for large responses

Reply via email to