Re: Theoretical limits for a HAProxy instance

Willy Tarreau Fri, 16 Dec 2022 09:11:59 -0800

On Fri, Dec 16, 2022 at 05:42:50PM +0100, Iago Alonso wrote:
> Hi,
> 
> > Ah that's pretty useful :-) It's very likely dealing with the handshake.
> > Could you please run "perf top" on this machine and list the 10 top-most
> > lines ? I'm interested in seeing if you're saturating on crypto functions
> > or locking functions (e.g. "native_queued_spin_lock_slowpath"), that would
> > indicate an abuse of mutexes.
> 
> We are load testing with k6 (https://k6.io/), in a server with an
> AMD Ryzen 5 3600 6-Core Processor (12 threads). We ran a test
> with 300k connections, 15k rps, and a sslrate and connrate of about 4k.
> The server had a `load1` of 12, the `node_nf_conntrack_entries` were about
> 450k. The `haproxy_server_connect_time_average_seconds` was between
> 0.5s and 2s, and the `haproxy_backend_connect_time_average_seconds` was
> always about 0.2s lower.
> 
> We have this custom kernel parameters set:
> net.nf_conntrack_max=5000000
> fs.nr_open=5000000
> net.ipv4.ip_local_port_range=12768    60999


Interesting, so you have conntrack enabled. With 5M conns, there's no
reason to fill your table. However, have you checked your servers' kernel
logs to see if you find any "conntrack table full" message, that might
be caused by too fast connection recycling ? 

> At that load, we were starting to see just a small number of 5xx, and
> in the logs
> we could see that they were caused by connect timeout.
> XXX.XXX.XXX.XXX:15636 https-front~ https-back/redacted.worker
> 261/0/-1/-1/10260 503 0 - - sC-- 419484/419480/104950/6557/0 0/0 "GET
> / HTTP/1.1" 0/0000000000000000/0/0/0 TLSv1.3/TLS_AES_256_GCM_SHA384

Hmmm not good. Check if you have any "-m state --state INVALID -j DROP"
rule in your iptables whose counter increases during the tests. It might
be possible that certain outgoing connections are rejected due to fast
port recycling if they were closed by an RST that was lost before hitting
conntrack. Another possibility here is that the TCP connection was
established but that the SSL handshake didn't complete in time. This
could be caused precisely by the openssl 3 locking issues I was speaking
about, but I guess not at such numbers, and you'd very likely see
the native_queue_spinlock_slowpath thing in perf top.

> This is the output from `perf top`, after 40M samples, during the test
> 
>   17.38%  libc.so.6                   [.] pthread_rwlock_rdlock
>   10.25%  libc.so.6                   [.] pthread_rwlock_unlock

With that said, excessive use of rwlock indicates that openssl is
very likely there, because that's about the only caller in the process.

To counter this abuse, we've developed a dirty counter-measure that
consists in providing our own rwlock implementation as a replacement for
the pthread one. That's really not beautiful but we have to deal with
what we have. You can enable it in 2.7 by adding USE_PTHREAD_EMULATION=1
to your "make" command line. These locks are way less expensive but they
will never sleep (no syscall) so instead of idle when the machine is
congested, you'll see pl_wait_unlock_long() eating CPU in "perf top"
output. In our tests that roughly doubles the performance under high
congestion but it's hard to be that specific. In fact it mostly gets
us our lock callback back, after they were removed in 1.1.0.

> Yes, we are running OpenSSL 3.0.7. We recently upgraded to Ubuntu 20.04,
> and we tested all the possible combinations between HAProxy 2.5.3 and 2.6.6,
> compiled with OpenSSL 1.1.1f and 3.0.7, and we didn't see a noticeable
> performance
> hit running similar tests as the previously mentioned one. We are
> planning to try again
> (next week) just stressing the sslrate so we can assess a bit more the
> performance
> hit.

OK. Please give 2.7.x a try (we hope to have 2.7.1 ready soon with a few
fixes, though these should not impact your tests and 2.7.0 will be OK).
And please check whether USE_PTHREAD_EMULATION=1 improves your situation.
If so it will definitely confirm that you're endlessly waiting for some
threads stuck in SSL waiting on a mutex.

I would really like that we distinguish L4 from L6 in logs output to
indicate whether L4 was established and SSL not or if it's blocking at
L4. The tricky part is that it requires adding new values for the
termination flags, and everyone got used a lot to the current ones...
But I think we'll have to do it anyway, at least to sort out such
issues.

Regards,
Willy

Re: Theoretical limits for a HAProxy instance

Reply via email to