Hi,

We've confirmed a few findings after we poured ~75-80Gbps of traffic
on purpose on a single machine:
- haproxy does indeed crashes;
- hence, we have no stats socket to collect a few things;

It seems that under pressure (not sure which conditions yet) the
kernel seems to be killing it. dmesg shows:

kernel: traps: haproxy[2057993] trap invalid opcode ip:5b3e26
sp:7fd7c002f100 error:0 in haproxy[42c000+1f7000]

This is a relatively new kernel:

Linux ndt-spo-12 6.1.60 #1 SMP PREEMPT_DYNAMIC Wed Oct 25 19:17:36 UTC
2023 x86_64 Intel(R) Xeon(R) Gold 6338N CPU @ 2.20GHz GenuineIntel
GNU/Linux

And it seems to happen on different kernels.

Does anyone have any tips on how to proceed to track this down?

Before the crash, "show info" showed only around 27,000 CurConn, so
not a great deal for maxconn 900000.

Thanks!

On Tue, Mar 26, 2024 at 11:33 PM Felipe Wilhelms Damasio
<felip...@taghos.com.br> wrote:
>
> Hi,
>
> Since we don't really know how to track this one, we thought it might
> be better to reach out here to get feedback.
>
> We're using haproxy to deliver streaming files under pressure
> (80-90Gbps per machine). When using h1/http, splice-response is a
> great help to keep load under control. We use branch v2.9 at the
> moment.
>
> However, we've hit a bug with splice-response (Github issue created)
> and we had to use all day our haproxies without splicing.
>
> When we reach a certain load, a "connection refused" alarm starting
> buzzing like crazy (2-3 times every 30 minutes). This alarm is simply
> a connect to localhost with 500ms timeout:
>
> socat /dev/null  tcp4-connect:127.0.0.1:80,connect-timeout=0.5
>
> The log file indicates the port is virtually closed:
>
> 2024/03/27 01:06:04 socat[984480] E read(6, 0xe98000, 8192): Connection 
> refused
>
> The thing is haproxy process is very much alive...so we just restart
> it everytime this happens.
>
> What data do you suggest we collect to help track this down? Not sure
> if the stats socket is available, but we can definitely try and get
> some information.
>
> We're not running out of fds, or even connections with/without backlog
> (we have a global maxconn of 900000 with ~30,000 streaming sessions
> active and we have tcp_max_syn_backlog set to 262144), we checked. But
> it seems to correlate with heavy traffic.
>
> Thanks!
>
> --
> Felipe Damasio



-- 
Felipe Damasio

Reply via email to