Hi Krzysztof,

On Fri, Aug 29, 2025 at 10:21:07AM +0200, Krzysztof Kozlowski wrote:
> We observed a high CPU utilization on several CPU cores on our HAProxy
> instances running version 3.2.4 (canary deployment) in production.
> 
> Following a recent L7 DDoS attack, we noticed high CPU utilization on several
> CPU cores on these machines. For comparison, other machines in the cluster
> running HAProxy version 3.0.10 were not impacted by this issue. High CPU
> utilization was visible even after we stopped sending traffic to that canary
> machine. 

Hmmm that's definitely not good!

> HAProxy canary version: 
> # haproxy -vv
> HAProxy version 3.2.4-98813a1 2025/08/13 - https://haproxy.org/
(...)
> Our  3.2.4 version also contains a patch: 
> https://git.haproxy.org/?p=haproxy.git;a=patch;h=2c7e05f80e3b4bacad84eb213b9db3bc5d8963dc.

OK thanks for clarifying, I can confirm that this one is not relevant.

> Not sure if it is important but we assigned dedicated cores to haproxy with 
> those commands:
> nbthread 24
> cpu-map auto:1/1-24 4-27

Useful as well but not involved here.

> Other machines in a cluster which were not impacted by the high CPU
> utilization issue were running 3.0.10 version:
> # haproxy -vv
> HAProxy version 3.0.10-346eb4f 2025/04/22 - https://haproxy.org/
(...)
> After spotting high CPU utilization we stopped to send traffic to canary
> machine (take out of the cluster), and after it we run show threads:
> # echo "show threads" | socat /var/run/haproxy/haproxy1.sock stdio
(...)

Many thanks for providing these outputs, they are extremely useful!
See below:

>   Thread 6 : id=0x7f08f32e0640 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
>       1/6    stuck=0 prof=0 harmless=0 isolated=0
>              cpu_ns: poll=4546197562468 now=4546197574309 diff=11841
>              curr_task=0x7f08e5300970 (tasklet) calls=637570029
                                                        ^^^^^^^^^
>                fct=0x55fbae711e00(h2_io_cb) ctx=0x7f08e6edfac0
                                    ^^^^^^^^

This function is the ones which handles the I/O events for a HTTP/2
connection. The fact that it was called half a billion times definitely
explains why you're seeing a huge CPU usage.

>              lock_hist: W:TASK_WQ S:TASK_WQ U:TASK_WQ W:STK_TABLE U:STK_TABLE 
> W:TASK_WQ R:TASK_WQ U:TASK_WQ
>              call trace(16):
>              | 0x55fbae834118 <00 00 00 e8 c8 ca dd ff]: 
> ha_dump_backtrace+0x88/0x3e5 > main-0x2910
>              | 0x55fbae837477 <00 31 f6 e8 49 f6 ff ff]: 
> debug_handler+0xa7/0xca > ha_thread_dump_one
>              | 0x7f098e718520 <00 00 00 00 0f 1f 40 00]: libc:+0x42520
>              | 0x7f098eb0f2c1 <48 c7 43 f8 00 00 00 00]: 
> libcrypto:ERR_clear_error+0x41/0xaa
>              | 0x7f098eee6dac <00 00 00 e8 24 a5 fc ff]: libssl:+0x4bdac > 
> libssl:+0x162d0
>              | 0x7f098eee717f <f4 89 d5 e8 01 fc ff ff]: 
> libssl:SSL_peek+0x1f/0x95 > libssl:+0x4bd80
>              | 0x7f098eee7211 <83 ec 10 e8 9f a6 fc ff]: 
> libssl:SSL_read+0x11/0x6f > libssl:+0x168b0
>              | 0x55fbae622615 <03 76 08 e8 4b f4 fe ff]: main+0xf125 > 
> main-0x1a90
>              | 0x55fbae6feeaf <4c 89 f2 4c 89 e7 ff 10]: 
> hlua_process_task+0x21e2f
>              | 0x55fbae712184 <48 89 df e8 1c c9 fe ff]: h2_io_cb+0x384/0x637 
> > hlua_process_task+0x21a20

Here I'm suspecting that an error is reported by the SSL stack and that
for whatever reason it is either dropped, or ignored (e.g. if we can't
progress in the processing for whatever reason).

>   Thread 10: id=0x7f08f12dc640 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
>       1/10   stuck=0 prof=0 harmless=0 isolated=0
>              cpu_ns: poll=4455933245738 now=4455933256347 diff=10609
>              curr_task=0x7f098a48d4b0 (tasklet) calls=317426162

>   Thread 13: id=0x7f08efad9640 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
>       1/13   stuck=0 prof=0 harmless=0 isolated=0
>              cpu_ns: poll=4744029449011 now=4744029457164 diff=8153
>              curr_task=0x7f08b4975c60 (tasklet) calls=620312397

>   Thread 15: id=0x7f08eead7640 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
>       1/15   stuck=0 prof=0 harmless=0 isolated=0
>              cpu_ns: poll=4369482934362 now=4369482942272 diff=7910
>              curr_task=0x7f08d7ba8940 (tasklet) calls=621762017

> and then we enabled profiling:
> # echo "set profiling tasks on" | socat /run/haproxy/haproxy1.sock stdio
> # echo "show profiling" | socat /run/haproxy/haproxy1.sock stdio
> Per-task CPU profiling              : on            # set profiling tasks 
> {on|auto|off}
> Memory usage profiling              : off           # set profiling memory 
> {on|off}
> Tasks activity over 42.067 sec till 0.000 sec ago:
>   function                      calls   cpu_tot   cpu_avg   lat_tot   lat_avg
>   ssl_sock_io_cb            143271204   57.69s    402.0ns   56.17s    392.0ns 
> <- conn_subscribe@src/connection.c:959 tasklet_wakeup
>   h2_io_cb                  143271202   56.15s    391.0ns   53.70s    374.0ns 
> <- ssl_sock_io_cb@src/ssl_sock.c:5812 tasklet_wakeup

This is extremely useful as well, we clearly see that something is wrong
here. We're at this location:

        if ((ctx->conn->flags & CO_FL_ERROR) ||
            !(ctx->conn->flags & CO_FL_SSL_WAIT_HS)) {
                int woke = 0;

                /* On error, wake any waiter */
                if (ctx->subs) {
-->                     tasklet_wakeup(ctx->subs->tasklet);
                        ctx->subs->events = 0;
                        woke = 1;
                        ctx->subs = NULL;
                }

So basically SSL is reporting an error, waking up the H2 layer, which
consults the transport layer to retrieve data/get updates, and does
nothing of it. It's unclear to me what we've changed since 3.0 that
could explain this but that's a serious problem we need to figure and
fix.

If the issue happens again, a "show sess all" would be super helpful
to understand the internal states that are causing this. However, it
can disclose some information you don't want to share publicly (e.g.
IP addresses, internal server names etc), so if you can get one, feel
free to send privately.

By the way, are you using H2 to reach the servers or is it only on
the frontends ? I'm trying to figure what side could be impacted.

Thanks!
Willy


Reply via email to