Re: Help tracking "connection refused" under pressure on v2.9

2024-03-27 Thread Felipe Wilhelms Damasio
Hi,

We've confirmed a few findings after we poured ~75-80Gbps of traffic
on purpose on a single machine:
- haproxy does indeed crashes;
- hence, we have no stats socket to collect a few things;

It seems that under pressure (not sure which conditions yet) the
kernel seems to be killing it. dmesg shows:

kernel: traps: haproxy[2057993] trap invalid opcode ip:5b3e26
sp:7fd7c002f100 error:0 in haproxy[42c000+1f7000]

This is a relatively new kernel:

Linux ndt-spo-12 6.1.60 #1 SMP PREEMPT_DYNAMIC Wed Oct 25 19:17:36 UTC
2023 x86_64 Intel(R) Xeon(R) Gold 6338N CPU @ 2.20GHz GenuineIntel
GNU/Linux

And it seems to happen on different kernels.

Does anyone have any tips on how to proceed to track this down?

Before the crash, "show info" showed only around 27,000 CurConn, so
not a great deal for maxconn 90.

Thanks!

On Tue, Mar 26, 2024 at 11:33 PM Felipe Wilhelms Damasio
 wrote:
>
> Hi,
>
> Since we don't really know how to track this one, we thought it might
> be better to reach out here to get feedback.
>
> We're using haproxy to deliver streaming files under pressure
> (80-90Gbps per machine). When using h1/http, splice-response is a
> great help to keep load under control. We use branch v2.9 at the
> moment.
>
> However, we've hit a bug with splice-response (Github issue created)
> and we had to use all day our haproxies without splicing.
>
> When we reach a certain load, a "connection refused" alarm starting
> buzzing like crazy (2-3 times every 30 minutes). This alarm is simply
> a connect to localhost with 500ms timeout:
>
> socat /dev/null  tcp4-connect:127.0.0.1:80,connect-timeout=0.5
>
> The log file indicates the port is virtually closed:
>
> 2024/03/27 01:06:04 socat[984480] E read(6, 0xe98000, 8192): Connection 
> refused
>
> The thing is haproxy process is very much alive...so we just restart
> it everytime this happens.
>
> What data do you suggest we collect to help track this down? Not sure
> if the stats socket is available, but we can definitely try and get
> some information.
>
> We're not running out of fds, or even connections with/without backlog
> (we have a global maxconn of 90 with ~30,000 streaming sessions
> active and we have tcp_max_syn_backlog set to 262144), we checked. But
> it seems to correlate with heavy traffic.
>
> Thanks!
>
> --
> Felipe Damasio



-- 
Felipe Damasio



Help tracking "connection refused" under pressure on v2.9

2024-03-26 Thread Felipe Wilhelms Damasio
Hi,

Since we don't really know how to track this one, we thought it might
be better to reach out here to get feedback.

We're using haproxy to deliver streaming files under pressure
(80-90Gbps per machine). When using h1/http, splice-response is a
great help to keep load under control. We use branch v2.9 at the
moment.

However, we've hit a bug with splice-response (Github issue created)
and we had to use all day our haproxies without splicing.

When we reach a certain load, a "connection refused" alarm starting
buzzing like crazy (2-3 times every 30 minutes). This alarm is simply
a connect to localhost with 500ms timeout:

socat /dev/null  tcp4-connect:127.0.0.1:80,connect-timeout=0.5

The log file indicates the port is virtually closed:

2024/03/27 01:06:04 socat[984480] E read(6, 0xe98000, 8192): Connection refused

The thing is haproxy process is very much alive...so we just restart
it everytime this happens.

What data do you suggest we collect to help track this down? Not sure
if the stats socket is available, but we can definitely try and get
some information.

We're not running out of fds, or even connections with/without backlog
(we have a global maxconn of 90 with ~30,000 streaming sessions
active and we have tcp_max_syn_backlog set to 262144), we checked. But
it seems to correlate with heavy traffic.

Thanks!

-- 
Felipe Damasio