Hi Shawn,

On Fri, May 26, 2023 at 11:17:15PM -0600, Shawn Heisey wrote:
> On 5/25/23 09:08, Willy Tarreau wrote:
> > The problem definitely is concurrency, so 1000 curl will show nothing
> > and will not even match production traffic. You'll need to use a load
> > generator that allows you to tweak the TLS resume support, like we do
> > with h1load's argument "--tls-reuse". Also I don't know how often the
> > recently modified locks are used per server connection and per client
> > connection, that's what the SSL guys want to know since they're not able
> > to test their changes.
> 
> I finally got a test program together.  After trying and failing with the
> Jetty HttpClient and Apache HttpClient version 5 (both options that would
> have let me do HTTP/2) I got a program together with Apache HttpClient
> version 4.  I had one version that shelled out to curl, but it ran about ten
> times slower.
> 
> I know lots of people are going to have bad things to say about writing a
> test in Java.  It's the only language where I already know how to write
> multi-threaded code.

:-)

> I would have to spend a bunch of time learning how to
> do that in another language.

For h2 there's h2load that is available but it doesn't allow you to close
and re-open connections.

> It fires up X threads, each of which make 1000 consecutive requests to the
> URL specified.  It records the time in milliseconds for each request, and
> when all the threads finish, prints out statistics.  These runs are with 24
> threads.  I ran it on a different system so that it would not affect CPU
> usage on the server running haproxy.  Here's the results:
> 
> quictls branch: OpenSSL_1_1_1t+quic
> 23:01:19.067 [main] INFO  o.e.t.h.MainSSLTest Count 24000 1228.69/s
> 23:01:19.069 [main] INFO  o.e.t.h.MainSSLTest Median 7562839 ns
> 23:01:19.069 [main] INFO  o.e.t.h.MainSSLTest 75th % 25138492 ns
> 23:01:19.070 [main] INFO  o.e.t.h.MainSSLTest 95th % 70603313 ns
> 23:01:19.070 [main] INFO  o.e.t.h.MainSSLTest 99th % 120502022 ns
> 23:01:19.070 [main] INFO  o.e.t.h.MainSSLTest 99.9 % 355829439 ns
> 
> quictls branch: openssl-3.1.0+quic+locks
> 22:56:11.457 [main] INFO  o.e.t.h.MainSSLTest Count 24000 1267.96/s
> 22:56:11.459 [main] INFO  o.e.t.h.MainSSLTest Median 6827111 ns
> 22:56:11.459 [main] INFO  o.e.t.h.MainSSLTest 75th % 23239248 ns
> 22:56:11.460 [main] INFO  o.e.t.h.MainSSLTest 95th % 70625628 ns
> 22:56:11.460 [main] INFO  o.e.t.h.MainSSLTest 99th % 129494323 ns
> 22:56:11.460 [main] INFO  o.e.t.h.MainSSLTest 99.9 % 307070582 ns
> 
> quictls branch: openssl-3.0.8+quic
> 22:59:12.614 [main] INFO  o.e.t.h.MainSSLTest Count 24000 1163.24/s
> 22:59:12.616 [main] INFO  o.e.t.h.MainSSLTest Median 6930268 ns
> 22:59:12.616 [main] INFO  o.e.t.h.MainSSLTest 75th % 26238752 ns
> 22:59:12.616 [main] INFO  o.e.t.h.MainSSLTest 95th % 75464869 ns
> 22:59:12.616 [main] INFO  o.e.t.h.MainSSLTest 99th % 132522508 ns
> 22:59:12.617 [main] INFO  o.e.t.h.MainSSLTest 99.9 % 445411125 ns
> 
> The stats don't show any kind of smoking gun like I had hoped they would.
> Not a lot of difference there.
> 
> Differences in the requests per second are also not huge, but more in line
> with what I was expecting.  If I can believe those numbers, and I admit that
> this kind of micro-benchmark is not the most reliable way to test
> performance, it looks like 3.1.0 with the lock fixes is slightly faster than
> 1.1.1t. 24 threads might not be enough to really exercise the concurrency
> though.

The little difference makes me think you've sent your requests over
a keep-alive connection, which is fine, but which doesn't stress the
TLS stack anymore. Those suffering from TLS performance problems are
those with many connections where the sole fact of resuming a TLS
session (and even more creating a new one) takes a lot of time. But
if your requests all pass over established connections, the TLS stack
does nothing anymore, that's just trivial AES crypto that comes for
free nowadays.

I have updated the ticket there with my measurements. With 24 cores
I didn't measure a big difference in new sessions rate since the CPU
was dominated by asymmetric crypto (27.4k for 3.1 vs 30.5k for 1.1.1
and 35k for wolfSSL). However with resumed connections the difference
was more visible: 48.5k for 3.1, 49.9k for 3.1+locks, 106k for 1.1.1
and 124k for wolfSSL. And there, there's not that much contention
(around 15% CPU lost waiting for a lock), which tends to indicate that
it's mainly the excess usage of locks (even uncontended) or atomic ops
that divides the performance by 2-2.5.

For some users it means that if they currently need 4 LB to stay under
80% load in 1.1.1, they will need 8-9 with 3.1 under the same conditions.

Another point that I didn't measure there (because it's always a pain
to do) is the client mode, which is much more affected. It's less
dramatic in 3.1 than in 3.0 but still very impacted. This will affect
re-encrypted communications between haproxy and the origin servers,
which is commonly used in cloud environments.

Willy

Reply via email to