On Wed, Apr 23, 2025 at 12:35:51PM +0200, Andreas Bartelt wrote:
> Hi,
> 
> I've noticed that aes-128-gcm performance with scp(1) on amd64 based CPUs is
> much slower than expected on OpenBSD (i.e., I remember throughput being
> significantly better some time ago -- I think I saw much better throughput
> around the time when LRO and TSO were initially enabled for ix(4)). It looks
> to me like AES-NI isn't effectively used anymore.

Right. Thanks for the report. The immediate reason for this is that ssh
relies on calls to OpenSSL_add_all_algorithms() to initialize libcrypto.
However, the call to OPENSSL_cpuid_setup() was removed from this function
(OPENSSL_add_all_algorithms_noconf()) in c_all.c r1.32 aka

https://github.com/openbsd/src/commit/b2368ebdada0d6d022d20bbe96eab69dbc406e5a

which means that the cpuid probe choosing an accelerated version if HW
support is available is no longer set up. This coincidentally happened
about a week after LRO was enabled by bluhm for all drivers in:

https://github.com/openbsd/src/commit/3e1926f859efd008e94373bdb5bd5e8d9fb98874

Another bit that will hurt is that ssh switched from aes-128-ctr to
aes-128-gcm by default last December:

https://github.com/openbsd/src/commit/08d45e79c0d607376dd5c42234e36d78473c3ae0

This doesn't make much of a difference in the unaccelerated case:

Without AES-NI
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-gcm     174617.32k   211996.90k   693919.98k   754392.03k 775449.26k
aes-128-ctr     185805.70k   216658.12k   778577.33k   888563.84k 915544.45k

but, since our GCM ASM is pretty bad, this will hurt in the accelerated
case. jsing will be looking into improving that since this is also
important for TLS.

With AES-NI:
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-gcm     669421.74k  1886143.60k  3064423.66k  3495542.89k 3564934.49k
aes-128-ctr     990493.56k  3246635.81k  6959034.82k  9812672.93k 11506436.47k

While we could (and probably should) add OPENSSL_init_crypto() calls to
the various *add_all* API, I think a better first fix will be this,
which means that the cpuid_setup happens whenever a cipher or a digest
is invoked via EVP and the accelerated implementation should be chosen
if available:

Index: evp/evp_cipher.c
===================================================================
RCS file: /cvs/src/lib/libcrypto/evp/evp_cipher.c,v
diff -u -p -r1.23 evp_cipher.c
--- evp/evp_cipher.c    10 Apr 2024 15:00:38 -0000      1.23
+++ evp/evp_cipher.c    23 Apr 2025 13:52:22 -0000
@@ -614,6 +614,9 @@ LCRYPTO_ALIAS(EVP_DecryptFinal_ex);
 EVP_CIPHER_CTX *
 EVP_CIPHER_CTX_new(void)
 {
+       if (!OPENSSL_init_crypto(0, NULL))
+               return NULL;
+
        return calloc(1, sizeof(EVP_CIPHER_CTX));
 }
 LCRYPTO_ALIAS(EVP_CIPHER_CTX_new);
Index: evp/evp_digest.c
===================================================================
RCS file: /cvs/src/lib/libcrypto/evp/evp_digest.c,v
diff -u -p -r1.14 evp_digest.c
--- evp/evp_digest.c    10 Apr 2024 15:00:38 -0000      1.14
+++ evp/evp_digest.c    23 Apr 2025 13:14:36 -0000
@@ -226,6 +226,9 @@ LCRYPTO_ALIAS(EVP_Digest);
 EVP_MD_CTX *
 EVP_MD_CTX_new(void)
 {
+       if (!OPENSSL_init_crypto(0, NULL))
+               return NULL;
+
        return calloc(1, sizeof(EVP_MD_CTX));
 }
 LCRYPTO_ALIAS(EVP_MD_CTX_new);

Reply via email to