Hi,

We are not sure what element produces the errors, in the haproxy logs we don't see them. What does it happen with the new connections when we hit the limit? What server resource should be affected by it, if any?

We have our logs in `warning` level so we do not see the response time on the haproxy logs. In the `haproxy_backend_response_time_average_seconds` we see "normal" values (20ms average and some spikes at less than 80ms).

We see a spike in the `haproxy_backend_connect_time_average_seconds`, as well as in the `haproxy_server_connect_time_average_seconds` metrics when the errors start. We have more haproxies (that we were not stressing in the test) in front of the same backends, serving without issues. We are using https between to communicate to the backends too, and we see that the task consuming the majority of the resources is `ssl_sock_io_cb`.

Our traffic is not request rate intensive, but we do have a high amount of concurrent connections. For example, at 200k connections, we have 10k rps, and at 300k connections, we have 14k rps.

In our tests, we never saturate our network bandwidth, we hit as much as 50% our available bandwidth for the server.

Regards.


On 14/12/22 8:09, Willy Tarreau wrote:
Hi,

On Tue, Dec 13, 2022 at 03:33:58PM +0100, Iago Alonso wrote:
Hi,

We do hit our defined max ssl/conn rates, but given the spare
resources, we don't expect to suddenly return 5xx.
What bothers me is that once this limit is reached there's no more
connection accepted by haproxy so you should indeed not see such
errors. What element produces these errors ? Since you've enabled
logging, can you check them in your logs and figure whether they're
sent by the server, by haproxy or none (it might well be the load
generator translating connection timeouts to 5xx for user reporting).

If the errors are produced by haproxy, then their exact value and the
termination flags are very important as they'll indicate where the
problem is.

Another thing you'll observe in your logs are the server's response
time. It's possible that your request rate is getting very close to
the server's limits.

Among other limitations the packet rate and network bandwidth might
represent a limit. For example, let's say you're requesting 10kB
objects. At 10k/s it's just less than a gigabit/s, at 11k/s it doesn't
pass anymore.

When you look at the haproxy stats page, the cumulated live network
bandwidth is reported at the top, it might give you an indication of
a possible limit.

But in any case, stats on the status codes and termination codes in
logs would be extremely useful. Note that you can do this natively
using halog (I seem to remember it's halog -st and halog -tc, but
please double-check, and in any case, cut|sort|uniq -c always works).

Regards,
Willy

Reply via email to