On Wed, Apr 23, 2025 at 12:35:51PM +0200, Andreas Bartelt wrote: > Hi, > > I've noticed that aes-128-gcm performance with scp(1) on amd64 based CPUs is > much slower than expected on OpenBSD (i.e., I remember throughput being > significantly better some time ago -- I think I saw much better throughput > around the time when LRO and TSO were initially enabled for ix(4)). It looks > to me like AES-NI isn't effectively used anymore.
Right. Thanks for the report. The immediate reason for this is that ssh relies on calls to OpenSSL_add_all_algorithms() to initialize libcrypto. However, the call to OPENSSL_cpuid_setup() was removed from this function (OPENSSL_add_all_algorithms_noconf()) in c_all.c r1.32 aka https://github.com/openbsd/src/commit/b2368ebdada0d6d022d20bbe96eab69dbc406e5a which means that the cpuid probe choosing an accelerated version if HW support is available is no longer set up. This coincidentally happened about a week after LRO was enabled by bluhm for all drivers in: https://github.com/openbsd/src/commit/3e1926f859efd008e94373bdb5bd5e8d9fb98874 Another bit that will hurt is that ssh switched from aes-128-ctr to aes-128-gcm by default last December: https://github.com/openbsd/src/commit/08d45e79c0d607376dd5c42234e36d78473c3ae0 This doesn't make much of a difference in the unaccelerated case: Without AES-NI type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-gcm 174617.32k 211996.90k 693919.98k 754392.03k 775449.26k aes-128-ctr 185805.70k 216658.12k 778577.33k 888563.84k 915544.45k but, since our GCM ASM is pretty bad, this will hurt in the accelerated case. jsing will be looking into improving that since this is also important for TLS. With AES-NI: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-gcm 669421.74k 1886143.60k 3064423.66k 3495542.89k 3564934.49k aes-128-ctr 990493.56k 3246635.81k 6959034.82k 9812672.93k 11506436.47k While we could (and probably should) add OPENSSL_init_crypto() calls to the various *add_all* API, I think a better first fix will be this, which means that the cpuid_setup happens whenever a cipher or a digest is invoked via EVP and the accelerated implementation should be chosen if available: Index: evp/evp_cipher.c =================================================================== RCS file: /cvs/src/lib/libcrypto/evp/evp_cipher.c,v diff -u -p -r1.23 evp_cipher.c --- evp/evp_cipher.c 10 Apr 2024 15:00:38 -0000 1.23 +++ evp/evp_cipher.c 23 Apr 2025 13:52:22 -0000 @@ -614,6 +614,9 @@ LCRYPTO_ALIAS(EVP_DecryptFinal_ex); EVP_CIPHER_CTX * EVP_CIPHER_CTX_new(void) { + if (!OPENSSL_init_crypto(0, NULL)) + return NULL; + return calloc(1, sizeof(EVP_CIPHER_CTX)); } LCRYPTO_ALIAS(EVP_CIPHER_CTX_new); Index: evp/evp_digest.c =================================================================== RCS file: /cvs/src/lib/libcrypto/evp/evp_digest.c,v diff -u -p -r1.14 evp_digest.c --- evp/evp_digest.c 10 Apr 2024 15:00:38 -0000 1.14 +++ evp/evp_digest.c 23 Apr 2025 13:14:36 -0000 @@ -226,6 +226,9 @@ LCRYPTO_ALIAS(EVP_Digest); EVP_MD_CTX * EVP_MD_CTX_new(void) { + if (!OPENSSL_init_crypto(0, NULL)) + return NULL; + return calloc(1, sizeof(EVP_MD_CTX)); } LCRYPTO_ALIAS(EVP_MD_CTX_new);