Re: [RFC V2 0/5] Introduce AVX512 optimized crypto algorithms
Hi Andy, On 1/24/2021 8:23 AM, Andy Lutomirski wrote: On Fri, Jan 22, 2021 at 11:29 PM Megha Dey wrote: Optimize crypto algorithms using AVX512 instructions - VAES and VPCLMULQDQ (first implemented on Intel's Icelake client and Xeon CPUs). These algorithms take advantage of the AVX512 registers to keep the CPU busy and increase memory bandwidth utilization. They provide substantial (2-10x) improvements over existing crypto algorithms when update data size is greater than 128 bytes and do not have any significant impact when used on small amounts of data. However, these algorithms may also incur a frequency penalty and cause collateral damage to other workloads running on the same core(co-scheduled threads). These frequency drops are also known as bin drops where 1 bin drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz) are observed on the Icelake server. The AVX512 optimization are disabled by default to avoid impact on other workloads. In order to use these optimized algorithms: 1. At compile time: a. User must enable CONFIG_CRYPTO_AVX512 option b. Toolchain(assembler) must support VPCLMULQDQ and VAES instructions 2. At run time: a. User must set module parameter use_avx512 at boot time b. Platform must support VPCLMULQDQ and VAES features N.B. It is unclear whether these coarse grain controls(global module parameter) would meet all user needs. Perhaps some per-thread control might be useful? Looking for guidance here. I've just been looking at some performance issues with in-kernel AVX, and I have a whole pile of questions that I think should be answered first: What is the impact of using an AVX-512 instruction on the logical thread, its siblings, and other cores on the package? Does the impact depend on whether it’s a 512-bit insn or a shorter EVEX insn? What is the impact on subsequent shorter EVEX, VEX, and legacy SSE(2,3, etc) insns? How does VZEROUPPER figure in? I can find an enormous amount of misinformation online, but nothing authoritative. What is the effect of the AVX-512 states (5-7) being “in use”? As far as I can tell, the only operations that clear XINUSE[5-7] are XRSTOR and its variants. Is this correct? On AVX-512 capable CPUs, do we ever get a penalty for executing a non-VEX insn followed by a large-width EVEX insn without an intervening VZEROUPPER? The docs suggest no, since Broadwell and before don’t support EVEX, but I’d like to know for sure. My current opinion is that we should not enable AVX-512 in-kernel except on CPUs that we determine have good AVX-512 support. Based on some reading, that seems to mean Ice Lake Client and not anything before it. I also think a bunch of the above questions should be answered before we do any of this. Right now we have a regression of unknown impact in regular AVX support in-kernel, we will have performance issues in-kernel depending on what user code has done recently, and I'm still trying to figure out what to do about it. Throwing AVX-512 into the mix without real information is not going to improve the situation. We are currently working on providing you with answers on the questions you have raised regarding AVX. Thanks, Megha
Re: [RFC V1 7/7] crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ
Hi Ard, On 1/16/2021 9:16 AM, Ard Biesheuvel wrote: On Fri, 18 Dec 2020 at 22:08, Megha Dey wrote: Introduce the AVX512 implementation that optimizes the AESNI-GCM encode and decode routines using VPCLMULQDQ. The glue code in AESNI module overrides the existing AVX2 GCM mode encryption/decryption routines with the AX512 AES GCM mode ones when the following criteria are met: At compile time: 1. CONFIG_CRYPTO_AVX512 is enabled 2. toolchain(assembler) supports VPCLMULQDQ instructions At runtime: 1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently only Icelake) 2. aesni_intel.use_avx512 module parameter is set at boot time. For this algorithm, switching from AVX512 optimized version is not possible once set at boot time because of how the code is structured today.(Can be changed later if required) The functions aesni_gcm_init_avx_512, aesni_gcm_enc_update_avx_512, aesni_gcm_dec_update_avx_512 and aesni_gcm_finalize_avx_512 are adapted from the Intel Optimized IPSEC Cryptographic library. On a Icelake desktop, with turbo disabled and all CPUs running at maximum frequency, the AVX512 GCM mode optimization shows better performance across data & key sizes as measured by tcrypt. The average performance improvement of the AVX512 version over the AVX2 version is as follows: For all key sizes(128/192/256 bits), data sizes < 128 bytes/block, negligible improvement (~7.5%) data sizes > 128 bytes/block, there is an average improvement of 40% for both encryption and decryption. A typical run of tcrypt with AES GCM mode encryption/decryption of the AVX2 and AVX512 optimization on a Icelake desktop shows the following results: -- | key | bytes | cycles/op (lower is better) | Percentage gain/ | | length | per | encryption | decryption | loss | | (bits) | block |---|---| || | avx2 | avx512 | avx2 | avx512 | Encrypt | Decrypt | |- | 128 | 16| 689 | 701 | 689 | 707 | -1.7 | -2.61 | | 128 | 64| 731 | 660 | 771 | 649 | 9.7 | 15.82 | | 128 | 256 | 911 | 750 | 900 | 721 | 17.67 | 19.88 | | 128 | 512 | 1181 | 814 | 1161 | 782 | 31.07 | 32.64 | | 128 | 1024 | 1676 | 1052 | 1685 | 1030 | 37.23 | 38.87 | | 128 | 2048 | 2475 | 1447 | 2456 | 1419 | 41.53 | 42.22 | | 128 | 4096 | 3806 | 2154 | 3820 | 2119 | 43.41 | 44.53 | | 128 | 8192 | 9169 | 3806 | 6997 | 3718 | 58.49 | 46.86 | | 192 | 16| 754 | 683 | 737 | 672 | 9.42 | 8.82 | | 192 | 64| 735 | 686 | 715 | 640 | 6.66 | 10.49 | | 192 | 256 | 949 | 738 | 2435 | 729 | 22.23 | 70 | | 192 | 512 | 1235 | 854 | 1200 | 833 | 30.85 | 30.58 | | 192 | 1024 | 1777 | 1084 | 1763 | 1051 | 38.99 | 40.39 | | 192 | 2048 | 2574 | 1497 | 2592 | 1459 | 41.84 | 43.71 | | 192 | 4096 | 4086 | 2317 | 4091 | 2244 | 43.29 | 45.14 | | 192 | 8192 | 7481 | 4054 | 7505 | 3953 | 45.81 | 47.32 | | 256 | 16| 755 | 682 | 720 | 683 | 9.68 | 5.14 | | 256 | 64| 744 | 677 | 719 | 658 | 9 | 8.48 | | 256 | 256 | 962 | 758 | 948 | 749 | 21.21 | 21 | | 256 | 512 | 1297 | 862 | 1276 | 836 | 33.54 | 34.48 | | 256 | 1024 | 1831 | 1114 | 1819 | 1095 | 39.16 | 39.8 | | 256 | 2048 | 2767 | 1566 | 2715 | 1524 | 43.4 | 43.87 | | 256 | 4096 | 4378 | 2382 | 4368 | 2354 | 45.6 | 46.11 | | 256 | 8192 | 8075 | 4262 | 8080 | 4186 | 47.22 | 48.19 | -- This work was inspired by the AES GCM mode optimization published in Intel Optimized IPSEC Cryptographic library. https://github.com/intel/intel-ipsec-mb/blob/master/lib/avx512/gcm_vaes_avx512.asm Co-developed-by: Tomasz Kantecki Signed-off-by: Tomasz Kantecki Signed-off-by: Megha Dey --- arch/x86/crypto/Makefile|1 + arch/x86/crypto/aesni-intel_avx512-x86_64.S | 1788 +++ arch/x86/crypto/aesni-intel_glue.c | 62 +- crypto/Kconfig | 12 + 4 files changed, 1858 insertions(+), 5 deletions(-) create mode 100644 arch/x86/crypto/aesni-intel_avx512-x86_64.S ... diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c index 9e56cdf..8fc5bac 100644 --- a/arch/x86/crypto/aesni-intel_glue.c +++ b/arch/x86/crypto/aesni-intel_glue.c @@ -55,13 +55,16 @@ MODULE_PARM_DESC(use_avx512, "Use AVX512 optimized algorithm, if available"); * This needs to be 16 b
Re: [RFC V1 1/7] x86: Probe assembler capabilities for VAES and VPLCMULQDQ support
Hi Ard, On 1/16/2021 8:54 AM, Ard Biesheuvel wrote: On Fri, 18 Dec 2020 at 22:07, Megha Dey wrote: This is a preparatory patch to introduce the optimized crypto algorithms using AVX512 instructions which would require VAES and VPLCMULQDQ support. Check for VAES and VPCLMULQDQ assembler support using AVX512 registers. Cc: x...@kernel.org Signed-off-by: Megha Dey --- arch/x86/Kconfig.assembler | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler index 26b8c08..9ea0bc8 100644 --- a/arch/x86/Kconfig.assembler +++ b/arch/x86/Kconfig.assembler @@ -1,6 +1,16 @@ # SPDX-License-Identifier: GPL-2.0 # Copyright (C) 2020 Jason A. Donenfeld . All Rights Reserved. +config AS_VAES_AVX512 + def_bool $(as-instr,vaesenc %zmm0$(comma)%zmm1$(comma)%zmm1) && 64BIT Is the '&& 64BIT' necessary here, but not below? In any case, better to use a separate 'depends on' line, for legibility yeah , I think the '&& 64 BIT' is not required. I will remove it in the next version. -Megha + help + Supported by binutils >= 2.30 and LLVM integrated assembler + +config AS_VPCLMULQDQ + def_bool $(as-instr,vpclmulqdq \$0$(comma)%zmm2$(comma)%zmm6$(comma)%zmm4) + help + Supported by binutils >= 2.30 and LLVM integrated assembler + config AS_AVX512 def_bool $(as-instr,vpmovm2b %k1$(comma)%zmm5) help -- 2.7.4
Re: [RFC V1 5/7] crypto: aesni - AES CTR x86_64 "by16" AVX512 optimization
Hi Ard, On 1/16/2021 9:03 AM, Ard Biesheuvel wrote: On Fri, 18 Dec 2020 at 22:08, Megha Dey wrote: Introduce the "by16" implementation of the AES CTR mode using AVX512 optimizations. "by16" means that 16 independent blocks (each block being 128 bits) can be ciphered simultaneously as opposed to the current 8 blocks. The glue code in AESNI module overrides the existing "by8" CTR mode encryption/decryption routines with the "by16" ones when the following criteria are met: At compile time: 1. CONFIG_CRYPTO_AVX512 is enabled 2. toolchain(assembler) supports VAES instructions At runtime: 1. VAES and AVX512VL features are supported on platform (currently only Icelake) 2. aesni_intel.use_avx512 module parameter is set at boot time. For this algorithm, switching from AVX512 optimized version is not possible once set at boot time because of how the code is structured today.(Can be changed later if required) The functions aes_ctr_enc_128_avx512_by16(), aes_ctr_enc_192_avx512_by16() and aes_ctr_enc_256_avx512_by16() are adapted from Intel Optimized IPSEC Cryptographic library. On a Icelake desktop, with turbo disabled and all CPUs running at maximum frequency, the "by16" CTR mode optimization shows better performance across data & key sizes as measured by tcrypt. The average performance improvement of the "by16" version over the "by8" version is as follows: For all key sizes(128/192/256 bits), data sizes < 128 bytes/block, negligible improvement(~3% loss) data sizes > 128 bytes/block, there is an average improvement of 48% for both encryption and decryption. A typical run of tcrypt with AES CTR mode encryption/decryption of the "by8" and "by16" optimization on a Icelake desktop shows the following results: -- | key | bytes | cycles/op (lower is better)| percentage | | length | per | encryption | decryption | loss/gain | | (bits) | block |---| || | by8 | by16 | by8 | by16 | enc | dec | || | 128 | 16 | 156 | 168 | 164 | 168 | -7.7 | -2.5 | | 128 | 64 | 180 | 190 | 157 | 146 | -5.6 | 7.1 | | 128 | 256 | 248 | 158 | 251 | 161 | 36.3 | 35.9 | | 128 | 1024 | 633 | 316 | 642 | 319 | 50.1 | 50.4 | | 128 | 1472 | 853 | 411 | 877 | 407 | 51.9 | 53.6 | | 128 | 8192 | 4463 | 1959 | 4447 | 1940 | 56.2 | 56.4 | | 192 | 16 | 136 | 145 | 149 | 166 | -6.7 | -11.5 | | 192 | 64 | 159 | 154 | 157 | 160 | 3.2 | -2 | | 192 | 256 | 268 | 172 | 274 | 177 | 35.9 | 35.5 | | 192 | 1024 | 710 | 358 | 720 | 355 | 49.6 | 50.7 | | 192 | 1472 | 989 | 468 | 983 | 469 | 52.7 | 52.3 | | 192 | 8192 | 6326 | 3551 | 6301 | 3567 | 43.9 | 43.4 | | 256 | 16 | 153 | 165 | 139 | 156 | -7.9 | -12.3 | | 256 | 64 | 158 | 152 | 174 | 161 | 3.8 | 7.5 | | 256 | 256 | 283 | 176 | 287 | 202 | 37.9 | 29.7 | | 256 | 1024 | 797 | 393 | 807 | 395 | 50.7 | 51.1 | | 256 | 1472 | 1108 | 534 | 1107 | 527 | 51.9 | 52.4 | | 256 | 8192 | 5763 | 2616 | 5773 | 2617 | 54.7 | 54.7 | -- This work was inspired by the AES CTR mode optimization published in Intel Optimized IPSEC Cryptographic library. https://github.com/intel/intel-ipsec-mb/blob/master/lib/avx512/cntr_vaes_avx512.asm Co-developed-by: Tomasz Kantecki Signed-off-by: Tomasz Kantecki Signed-off-by: Megha Dey --- arch/x86/crypto/Makefile| 1 + arch/x86/crypto/aes_ctrby16_avx512-x86_64.S | 856 arch/x86/crypto/aesni-intel_glue.c | 57 +- arch/x86/crypto/avx512_vaes_common.S| 422 ++ arch/x86/include/asm/disabled-features.h| 8 +- crypto/Kconfig | 12 + 6 files changed, 1354 insertions(+), 2 deletions(-) create mode 100644 arch/x86/crypto/aes_ctrby16_avx512-x86_64.S ... diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c index ad8a718..f45059e 100644 --- a/arch/x86/crypto/aesni-intel_glue.c +++ b/arch/x86/crypto/aesni-intel_glue.c @@ -46,6 +46,10 @@ #define CRYPTO_AES_CTX_SIZE (sizeof(struct crypto_aes_ctx) + AESNI_ALIGN_EXTRA) #define XTS_AES_CTX_SIZE (sizeof(struct aesni_xts_ctx) + AESNI_ALIGN_EXTRA) +static bool use_avx512; +module_param(use_avx512, bool, 0644); +MODULE_PARM_DESC(use_avx512, "Use AVX512 optimized algorithm, if available"); + /* This data is stored at the end of the crypto_tfm struct. * It's a type of per "session" data storage location. * This needs to be 16 byte aligned. @@ -191,6 +195,35 @@ asmlinkage void aes_ctr_enc_192_avx_by8(const u8 *in, u8 *iv, void *keys, u8 *out, unsigned int num_bytes); asmlinkage void aes_ctr_enc_256_avx
Re: [RFC V1 2/7] crypto: crct10dif - Accelerated CRC T10 DIF with vectorized instruction
Hi Ard, On 1/16/2021 9:00 AM, Ard Biesheuvel wrote: On Fri, 18 Dec 2020 at 22:07, Megha Dey wrote: From: Kyung Min Park Update the crc_pcl function that calculates T10 Data Integrity Field CRC16 (CRC T10 DIF) using VPCLMULQDQ instruction. VPCLMULQDQ instruction with AVX-512F adds EVEX encoded 512 bit version of PCLMULQDQ instruction. The advantage comes from packing multiples of 4 * 128 bit data into AVX512 reducing instruction latency. The glue code in crct10diff module overrides the existing PCLMULQDQ version with the VPCLMULQDQ version when the following criteria are met: At compile time: 1. CONFIG_CRYPTO_AVX512 is enabled 2. toolchain(assembler) supports VPCLMULQDQ instructions At runtime: 1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently only Icelake) 2. If compiled as built-in module, crct10dif_pclmul.use_avx512 is set at boot time or /sys/module/crct10dif_pclmul/parameters/use_avx512 is set to 1 after boot. If compiled as loadable module, use_avx512 module parameter must be set: modprobe crct10dif_pclmul use_avx512=1 A typical run of tcrypt with CRC T10 DIF calculation with PCLMULQDQ instruction and VPCLMULQDQ instruction shows the following results: For bytes per update >= 1KB, we see the average improvement of 46%(~1.4x) For bytes per update < 1KB, we see the average improvement of 13%. Test was performed on an Icelake based platform with constant frequency set for CPU. Detailed results for a variety of block sizes and update sizes are in the table below. --- ||| cycles/operation || ||| (the lower the better) || |byte| bytes|--| percentage | | blocks | per update | CRC T10 DIF | CRC T10 DIF| loss/gain | ||| with PCLMULQDQ | with VPCLMULQDQ || ||||-|| | 16| 16 |77 |106 | -27.0| | 64| 16 | 411 |390 | 5.4| | 64| 64 |71 | 85 | -16.0| | 256| 16 | 1224 | 1308 |-6.4| | 256| 64 | 393 |407 |-3.4| | 256|256 |93 | 86 | 8.1| |1024| 16 | 4564 | 5020 |-9.0| |1024|256 | 486 |475 | 2.3| |1024| 1024 | 221 |148 |49.3| |2048| 16 | 8945 | 9851 |-9.1| |2048|256 | 982 |951 | 3.3| |2048| 1024 | 500 |369 |35.5| |2048| 2048 | 413 |265 |55.8| |4096| 16 | 17885 | 19351 |-7.5| |4096|256 | 1828 | 1713 | 6.7| |4096| 1024 | 968 |805 |20.0| |4096| 4096 | 739 |475 |55.6| |8192| 16 | 48339 | 41556 |16.3| |8192|256 | 3494 | 3342 | 4.5| |8192| 1024 | 1959 | 1462 |34.0| |8192| 4096 | 1561 | 1036 |50.7| |8192| 8192 | 1540 | 1004 |53.4| --- This work was inspired by the CRC T10 DIF AVX512 optimization published in Intel Intelligent Storage Acceleration Library. https://github.com/intel/isa-l/blob/master/crc/crc16_t10dif_by16_10.asm Co-developed-by: Greg Tucker Signed-off-by: Greg Tucker Co-developed-by: Tomasz Kantecki Signed-off-by: Tomasz Kantecki Signed-off-by: Kyung Min Park Signed-off-by: Megha Dey --- arch/x86/crypto/Makefile | 1 + arch/x86/crypto/crct10dif-avx512-asm_64.S | 482 ++ arch/x86/crypto/crct10dif-pclmul_glue.c | 24 +- arch/x86/include/asm/disabled-features.h | 8 +- crypto/Kconfig| 23 ++ 5 files changed, 535 insertions(+), 3 deletions(-) create mode 100644 arch/x86/crypto/crct10dif-avx512-asm_64.S ... diff --git a/arch/x86/crypto/crct10dif-pclmul_glue.c b/arch/x86/crypto/crct10dif-pclmul_glue.c index 71291d5a..26a6350 100644 --- a/arch/x86/crypto/crct10dif-pclmul_glue.c +++ b/arch/x86/crypto/crct10dif-pclmul_glue.c @@ -35,6 +35,16 @@ #include asmlinkage u16 crc_t10dif_pcl(u16 init_crc, const u8 *buf, size_t len); +#ifdef CONFIG_CRYPTO_CRCT10DIF_AVX512 +asmlinkage u
Re: [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms
Hi Ard, On 1/16/2021 8:52 AM, Ard Biesheuvel wrote: On Mon, 28 Dec 2020 at 20:11, Dey, Megha wrote: Hi Eric, On 12/21/2020 3:20 PM, Eric Biggers wrote: On Fri, Dec 18, 2020 at 01:10:57PM -0800, Megha Dey wrote: Optimize crypto algorithms using VPCLMULQDQ and VAES AVX512 instructions (first implemented on Intel's Icelake client and Xeon CPUs). These algorithms take advantage of the AVX512 registers to keep the CPU busy and increase memory bandwidth utilization. They provide substantial (2-10x) improvements over existing crypto algorithms when update data size is greater than 128 bytes and do not have any significant impact when used on small amounts of data. However, these algorithms may also incur a frequency penalty and cause collateral damage to other workloads running on the same core(co-scheduled threads). These frequency drops are also known as bin drops where 1 bin drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz) are observed on the Icelake server. Do these new algorithms all pass the self-tests, including the fuzz tests that are enabled when CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y? I had tested these algorithms with CRYPTO_MANAGER_DISABLE_TESTS=n and tcrypt, not with CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y (I wasn't aware this existed, my bad). I see a couple of errors after enabling it and am working on fixing those. Hello Megha, I think the GHASH changes can be dropped (as discussed in the other thread), given the lack of a use case. The existing GHASH driver could also be removed in the future, but I don't think it needs to be part of this series. Ok, I will remove the GHASH patch from the next series. Could you please rebase this onto the latest AES-NI changes that are in Herbert's tree? (as well as the ones I sent out today) They address some issues with indirect calls and excessive disabling of preemption, and your GCM and CTR changes are definitely going to be affected by this as well. Yeah sure, will do, thanks for the headsup!
Re: [RFC V1 3/7] crypto: ghash - Optimized GHASH computations
On 1/15/2021 5:43 PM, Eric Biggers wrote: On Fri, Jan 15, 2021 at 04:14:40PM -0800, Dey, Megha wrote: Hello Megha, What is the purpose of this separate GHASH module? GHASH is only used in combination with AES-CTR to produce GCM, and this series already contains a GCM driver. Do cores exist that implement PCLMULQDQ but not AES-NI? If not, I think we should be able to drop this patch (and remove the existing PCLMULQDQ GHASH driver as well) AFAIK, dm-verity (authenticated but not encrypted file system) is one use case for authentication only. Although I am not sure if GHASH is specifically used for this or SHA? Also, I do not know of any cores that implement PCLMULQDQ and not AES-NI. dm-verity only uses unkeyed hash algorithms. So no, it doesn't use GHASH. Hmm, I see. If that is the case, I am not aware of any other use case apart from GCM. I see that the existing GHASH module in the kernel from 2009. I am not sure if there was a use case then, which now is no longer valid. There many be out-of-tree kernel modules which may be using it but again its only speculation. So, in the next version should I remove the existing GHASH module? (And of course remove this patch as well?) -Megha - Eric
Re: [RFC V1 3/7] crypto: ghash - Optimized GHASH computations
Hi Ard, On 12/19/2020 9:03 AM, Ard Biesheuvel wrote: On Fri, 18 Dec 2020 at 22:07, Megha Dey wrote: From: Kyung Min Park Optimize GHASH computations with the 512 bit wide VPCLMULQDQ instructions. The new instruction allows to work on 4 x 16 byte blocks at the time. For best parallelism and deeper out of order execution, the main loop of the code works on 16 x 16 byte blocks at the time and performs reduction every 48 x 16 byte blocks. Such approach needs 48 precomputed GHASH subkeys and the precompute operation has been optimized as well to leverage 512 bit registers, parallel carry less multiply and reduction. VPCLMULQDQ instruction is used to accelerate the most time-consuming part of GHASH, carry-less multiplication. VPCLMULQDQ instruction with AVX-512F adds EVEX encoded 512 bit version of PCLMULQDQ instruction. The glue code in ghash_clmulni_intel module overrides existing PCLMULQDQ version with the VPCLMULQDQ version when the following criteria are met: At compile time: 1. CONFIG_CRYPTO_AVX512 is enabled 2. toolchain(assembler) supports VPCLMULQDQ instructions At runtime: 1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently only Icelake) 2. If compiled as built-in module, ghash_clmulni_intel.use_avx512 is set at boot time or /sys/module/ghash_clmulni_intel/parameters/use_avx512 is set to 1 after boot. If compiled as loadable module, use_avx512 module parameter must be set: modprobe ghash_clmulni_intel use_avx512=1 With new implementation, tcrypt ghash speed test shows about 4x to 10x speedup improvement for GHASH calculation compared to the original implementation with PCLMULQDQ when the bytes per update size is 256 Bytes or above. Detailed results for a variety of block sizes and update sizes are in the table below. The test was performed on Icelake based platform with constant frequency set for CPU. The average performance improvement of the AVX512 version over the current implementation is as follows: For bytes per update >= 1KB, we see the average improvement of 882%(~8.8x). For bytes per update < 1KB, we see the average improvement of 370%(~3.7x). A typical run of tcrypt with GHASH calculation with PCLMULQDQ instruction and VPCLMULQDQ instruction shows the following results. --- ||| cycles/operation || ||| (the lower the better) || |byte| bytes|--| percentage | | blocks | per update | GHASH test | GHASH test| loss/gain | ||| with PCLMULQDQ | with VPCLMULQDQ || ||||-|| | 16| 16 | 144 |233 | -38.0| | 64| 16 | 535 |709 | -24.5| | 64| 64 | 210 |146 |43.8| | 256| 16 | 1808 | 1911 |-5.4| | 256| 64 | 865 |581 |48.9| | 256|256 | 682 |170 | 301.0| |1024| 16 | 6746 | 6935 |-2.7| |1024|256 | 2829 |714 | 296.0| |1024| 1024 | 2543 |341 | 645.0| |2048| 16 | 13219 | 13403 |-1.3| |2048|256 | 5435 | 1408 | 286.0| |2048| 1024 | 5218 |685 | 661.0| |2048| 2048 | 5061 |565 | 796.0| |4096| 16 | 40793 | 27615 |47.8| |4096|256 | 10662 | 2689 | 297.0| |4096| 1024 | 10196 | 1333 | 665.0| |4096| 4096 | 10049 | 1011 | 894.0| |8192| 16 | 51672 | 54599 |-5.3| |8192|256 | 21228 | 5284 | 301.0| |8192| 1024 | 20306 | 2556 | 694.0| |8192| 4096 | 20076 | 2044 | 882.0| |8192| 8192 | 20071 | 2017 | 895.0| --- This work was inspired by the AES GCM mode optimization published in Intel Optimized IPSEC Cryptographic library. https://github.com/intel/intel-ipsec-mb/lib/avx512/gcm_vaes_avx512.asm Co-developed-by: Greg Tucker Signed-off-by: Greg Tucker Co-developed-by: Tomasz Kantecki Signed-off-by: Tomasz Kantecki Signed-off-by: Kyung Min Park Co-developed-by: Megha Dey Signed-off-by: Megha Dey Hello Megha, What is the purpose of this s
Re: [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms
Hi Eric, On 12/21/2020 3:20 PM, Eric Biggers wrote: On Fri, Dec 18, 2020 at 01:10:57PM -0800, Megha Dey wrote: Optimize crypto algorithms using VPCLMULQDQ and VAES AVX512 instructions (first implemented on Intel's Icelake client and Xeon CPUs). These algorithms take advantage of the AVX512 registers to keep the CPU busy and increase memory bandwidth utilization. They provide substantial (2-10x) improvements over existing crypto algorithms when update data size is greater than 128 bytes and do not have any significant impact when used on small amounts of data. However, these algorithms may also incur a frequency penalty and cause collateral damage to other workloads running on the same core(co-scheduled threads). These frequency drops are also known as bin drops where 1 bin drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz) are observed on the Icelake server. Do these new algorithms all pass the self-tests, including the fuzz tests that are enabled when CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y? I had tested these algorithms with CRYPTO_MANAGER_DISABLE_TESTS=n and tcrypt, not with CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y (I wasn't aware this existed, my bad). I see a couple of errors after enabling it and am working on fixing those. Megha - Eric
RE: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure support
>-Original Message- >From: Dey, Megha >Sent: Friday, May 11, 2018 6:22 PM >To: Herbert Xu >Cc: linux-ker...@vger.kernel.org; linux-crypto@vger.kernel.org; >da...@davemloft.net >Subject: RE: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure >support > > > >>-Original Message- >>From: Herbert Xu [mailto:herb...@gondor.apana.org.au] >>Sent: Thursday, May 10, 2018 9:46 PM >>To: Dey, Megha >>Cc: linux-ker...@vger.kernel.org; linux-crypto@vger.kernel.org; >>da...@davemloft.net >>Subject: Re: [PATCH V8 1/5] crypto: Multi-buffer encryption >>infrastructure support >> >>On Fri, May 11, 2018 at 01:24:42AM +, Dey, Megha wrote: >>> >>> Are you suggesting that the SIMD wrapper, will do what is currently >>> being >>done by the ' mcryptd_queue_worker ' function (assuming FPU is not >>disabled) i.e dispatching the job to the inner algorithm? >>> >>> I have got rid of the mcryptd layer( have an inner layer, outer SIMD >>> layer, >>handled the pointers and completions accordingly), but still facing >>some issues after removing the per cpu mcryptd_cpu_queue. >> >>Why don't you post what you've got and we can work it out together? > >Hi Herbert, > >Sure, I will post an RFC patch. (crypto: Remove mcryptd). Hi Herbert, I had posted an RFC patch about a month back (2018/05/11). Could you please have a look? [RFC] crypto: Remove mcryptd > >> >>Thanks, >>-- >>Email: Herbert Xu Home Page: >>http://gondor.apana.org.au/~herbert/ >>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
RE: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure support
>-Original Message- >From: Herbert Xu [mailto:herb...@gondor.apana.org.au] >Sent: Thursday, May 10, 2018 9:46 PM >To: Dey, Megha >Cc: linux-ker...@vger.kernel.org; linux-crypto@vger.kernel.org; >da...@davemloft.net >Subject: Re: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure >support > >On Fri, May 11, 2018 at 01:24:42AM +, Dey, Megha wrote: >> >> Are you suggesting that the SIMD wrapper, will do what is currently being >done by the ' mcryptd_queue_worker ' function (assuming FPU is not disabled) >i.e dispatching the job to the inner algorithm? >> >> I have got rid of the mcryptd layer( have an inner layer, outer SIMD layer, >handled the pointers and completions accordingly), but still facing some issues >after removing the per cpu mcryptd_cpu_queue. > >Why don't you post what you've got and we can work it out together? Hi Herbert, Sure, I will post an RFC patch. (crypto: Remove mcryptd). > >Thanks, >-- >Email: Herbert Xu Home Page: >http://gondor.apana.org.au/~herbert/ >PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
RE: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure support
>-Original Message- >From: Herbert Xu [mailto:herb...@gondor.apana.org.au] >Sent: Monday, May 7, 2018 2:35 AM >To: Dey, Megha >Cc: linux-ker...@vger.kernel.org; linux-crypto@vger.kernel.org; >da...@davemloft.net >Subject: Re: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure >support > >On Tue, May 01, 2018 at 10:39:15PM +, Dey, Megha wrote: >> >> crypto/simd.c provides a simd_skcipher_create_compat. I have used the >> same template to introduce simd_ahash_create_compat which would wrap >around the inner hash algorithm. >> >> Hence we would still register 2 algs, outer and inner. > >Right. > >> Currently we have outer_alg -> mcryptd alg -> inner_alg >> >> Mcryptd is mainly providing the following: >> 1. Ensuring the lanes(8 in case of AVX2) are full before dispatching to the >lower inner algorithm. This is obviously why we would expect better >performance for multi-buffer as opposed to the present single-buffer >algorithms. >> 2. If there no new incoming jobs, issue a flush. >> 3. A glue layer which sends the correct pointers and completions. >> >> If we get rid of mcryptd, these functions needs to be done by someone. Since >all multi-buffer algorithms would require this tasks, where do you suggest >these >helpers live, if not the current mcryptd.c? > >That's the issue. I don't think mcryptd is doing any of these claimed >functions >except for hosting the flush helper which could really live anywhere. > >All these functions are actually being carried out in the inner algorithm >already. > >> I am not sure if you are suggesting that we need to get rid of the mcryptd >work queue itself. In that case, we would need to execute in the context of the >job requesting the crypto transformation. > >Which is fine as long as you can disable the FPU. If not the simd wrapper will >defer the job to kthread context as required. Hi Herbert, Are you suggesting that the SIMD wrapper, will do what is currently being done by the ' mcryptd_queue_worker ' function (assuming FPU is not disabled) i.e dispatching the job to the inner algorithm? I have got rid of the mcryptd layer( have an inner layer, outer SIMD layer, handled the pointers and completions accordingly), but still facing some issues after removing the per cpu mcryptd_cpu_queue. > >Cheers, >-- >Email: Herbert Xu Home Page: >http://gondor.apana.org.au/~herbert/ >PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
RE: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure support
>-Original Message- >From: Herbert Xu [mailto:herb...@gondor.apana.org.au] >Sent: Thursday, April 26, 2018 2:45 AM >To: Dey, Megha >Cc: linux-ker...@vger.kernel.org; linux-crypto@vger.kernel.org; >da...@davemloft.net >Subject: Re: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure >support > >On Wed, Apr 25, 2018 at 01:14:26AM +, Dey, Megha wrote: >> >> Is there any existing implementation of async crypto algorithm that uses the >above approach? The ones I could find are either sync, have an outer and >inner algorithm or use cryptd. >> >> I tried removing the mcryptd layer and the outer algorithm and some >> plumbing to pass the correct structures, but see crashes.(obviously >> some errors in the plumbing) > >OK, you can't just remove it because the inner algorithm requires >kernel_fpu_begin/kernel_fpu_end. So we do need two layers but I don't think >we need cryptd or mcryptd. > >The existing simd wrapper should work just fine on the inner algorithm, >provided that we add hash support to it. Hi Herbert, crypto/simd.c provides a simd_skcipher_create_compat. I have used the same template to introduce simd_ahash_create_compat which would wrap around the inner hash algorithm. Hence we would still register 2 algs, outer and inner. > >> I am not sure if we remove mcryptd, how would we queue work, flush >partially completed jobs or call completions (currently done by mcryptd) if we >simply call the inner algorithm. > >I don't think mcryptd is providing any real facility to the flushing apart >from a >helper. That same helper can live anywhere. Currently we have outer_alg -> mcryptd alg -> inner_alg Mcryptd is mainly providing the following: 1. Ensuring the lanes(8 in case of AVX2) are full before dispatching to the lower inner algorithm. This is obviously why we would expect better performance for multi-buffer as opposed to the present single-buffer algorithms. 2. If there no new incoming jobs, issue a flush. 3. A glue layer which sends the correct pointers and completions. If we get rid of mcryptd, these functions needs to be done by someone. Since all multi-buffer algorithms would require this tasks, where do you suggest these helpers live, if not the current mcryptd.c? I am not sure if you are suggesting that we need to get rid of the mcryptd work queue itself. In that case, we would need to execute in the context of the job requesting the crypto transformation. > >Cheers, >-- >Email: Herbert Xu Home Page: >http://gondor.apana.org.au/~herbert/ >PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
RE: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure support
>-Original Message- >From: Herbert Xu [mailto:herb...@gondor.apana.org.au] >Sent: Wednesday, April 18, 2018 8:25 PM >To: Dey, Megha >Cc: linux-ker...@vger.kernel.org; linux-crypto@vger.kernel.org; >da...@davemloft.net >Subject: Re: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure >support > >On Thu, Apr 19, 2018 at 12:54:16AM +, Dey, Megha wrote: >> >> Yeah I think I misunderstood. I think what you mean is to remove mcryptd.c >completely and avoid the extra layer of indirection to call the underlying >algorithm, instead call it directly, correct? >> >> So currently we have 3 algorithms registered for every multibuffer algorithm: >> name : __sha1-mb >> driver : mcryptd(__intel_sha1-mb) >> >> name : sha1 >> driver : sha1_mb >> >> name : __sha1-mb >> driver : __intel_sha1-mb >> >> If we remove mcryptd, then we will have just the 2? > >It should be down to just one, i.e., the current inner algorithm. >It's doing all the scheduling work already so I don't really see why it needs >the >wrappers around it. Hi Herbert, Is there any existing implementation of async crypto algorithm that uses the above approach? The ones I could find are either sync, have an outer and inner algorithm or use cryptd. I tried removing the mcryptd layer and the outer algorithm and some plumbing to pass the correct structures, but see crashes.(obviously some errors in the plumbing) I am not sure if we remove mcryptd, how would we queue work, flush partially completed jobs or call completions (currently done by mcryptd) if we simply call the inner algorithm. Are you suggesting these are not required at all? I am not sure how to move forward. > >Cheers, >-- >Email: Herbert Xu Home Page: >http://gondor.apana.org.au/~herbert/ >PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
RE: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure support
>-Original Message- >From: Herbert Xu [mailto:herb...@gondor.apana.org.au] >Sent: Wednesday, April 18, 2018 4:01 AM >To: Dey, Megha >Cc: linux-ker...@vger.kernel.org; linux-crypto@vger.kernel.org; >da...@davemloft.net >Subject: Re: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure >support > >On Tue, Apr 17, 2018 at 06:40:17PM +, Dey, Megha wrote: >> >> >> >-Original Message- >> >From: Herbert Xu [mailto:herb...@gondor.apana.org.au] >> >Sent: Friday, March 16, 2018 7:54 AM >> >To: Dey, Megha >> >Cc: linux-ker...@vger.kernel.org; linux-crypto@vger.kernel.org; >> >da...@davemloft.net >> >Subject: Re: [PATCH V8 1/5] crypto: Multi-buffer encryption >> >infrastructure support >> > >> >I have taken a deeper look and I'm even more convinced now that >> >mcryptd is simply not needed in your current model. >> > >> >The only reason you would need mcryptd is if you need to limit the >> >rate of requests going into the underlying mb algorithm. >> > >> >However, it doesn't do that all. Even though it seems to have a >> >batch size of 10, but because it immediately reschedules itself after >> >the batch runs out, it's essentially just dumping all requests at the >> >underlying algorithm as fast as they're coming in. The underlying >> >algorithm doesn't have need throttling anyway because it'll do the work >when the queue is full synchronously. >> > >> >So why not just get rid of mcryptd completely and expose the >> >underlying algorithm as a proper async skcipher/hash? >> >> Hi Herbert, >> >> Most part of the cryptd.c and mcryptd.c are similar, except the logic >> used to flush out partially completed jobs in the case of multibuffer >algorithms. >> >> I think I will try to merge the cryptd and mcryptd adding necessary quirks >> for >multibuffer where needed. > >I think you didn't quite get my point. From what I'm seeing you don't need >either cryptd or mcryptd. You just need to expose the underlying mb >algorithm directly. Hi Herbert, Yeah I think I misunderstood. I think what you mean is to remove mcryptd.c completely and avoid the extra layer of indirection to call the underlying algorithm, instead call it directly, correct? So currently we have 3 algorithms registered for every multibuffer algorithm: name : __sha1-mb driver : mcryptd(__intel_sha1-mb) name : sha1 driver : sha1_mb name : __sha1-mb driver : __intel_sha1-mb If we remove mcryptd, then we will have just the 2? The outer algorithm:sha1-mb, will > >So I'm not sure what we would gain from merging cryptd and mcryptd. > >Cheers, >-- >Email: Herbert Xu Home Page: >http://gondor.apana.org.au/~herbert/ >PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
RE: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure support
>-Original Message- >From: Herbert Xu [mailto:herb...@gondor.apana.org.au] >Sent: Friday, March 16, 2018 7:54 AM >To: Dey, Megha >Cc: linux-ker...@vger.kernel.org; linux-crypto@vger.kernel.org; >da...@davemloft.net >Subject: Re: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure >support > >On Thu, Jan 18, 2018 at 04:44:21PM -0800, Megha Dey wrote: >> >> > So the mcryptd template is in fact completely superfluous. You can >> > remove it and just have all the main encrypt/decrypt functions >> > invoke the underlying encrypt/decrypt function directly and achieve >> > the same result. >> > >> > Am I missing something? >> >> Hi Herbert, >> >> After discussing with Tim, it seems like the mcryptd is responsible >> for queuing up the encrypt requests and dispatching them to the actual >> multi-buffer raw algorithm. It also flushes the queue if we wait too >> long without new requests coming in to force dispatch of the requests >> in queue. >> >> Its function is analogous to cryptd but it has its own multi-lane >> twists so we haven't reused the cryptd interface. > >I have taken a deeper look and I'm even more convinced now that mcryptd is >simply not needed in your current model. > >The only reason you would need mcryptd is if you need to limit the rate of >requests going into the underlying mb algorithm. > >However, it doesn't do that all. Even though it seems to have a batch size of >10, but because it immediately reschedules itself after the batch runs out, >it's essentially just dumping all requests at the underlying algorithm as fast >as they're coming in. The underlying algorithm doesn't have need throttling >anyway because it'll do the work when the queue is full synchronously. > >So why not just get rid of mcryptd completely and expose the underlying >algorithm as a proper async skcipher/hash? Hi Herbert, Most part of the cryptd.c and mcryptd.c are similar, except the logic used to flush out partially completed jobs in the case of multibuffer algorithms. I think I will try to merge the cryptd and mcryptd adding necessary quirks for multibuffer where needed. Also, in cryptd.c, I see shash interface being used for hash digests, update, finup, setkey etc. whereas we have shifted to ahash interface for mcryptd. Is this correct? Thanks, Megha > >Thanks, >-- >Email: Herbert Xu Home Page: >http://gondor.apana.org.au/~herbert/ >PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
RE: sha1_mb broken
-Original Message- From: Stephan Mueller [mailto:smuel...@chronox.de] Sent: Wednesday, September 28, 2016 10:31 PM To: Dey, Megha Cc: linux-crypto@vger.kernel.org; tim.c.c...@linux.intel.com Subject: Re: sha1_mb broken Am Mittwoch, 28. September 2016, 22:52:46 CEST schrieb Dey, Megha: Hi Megha, see a self contained example code attached. > Hi Stephan, > > Your test code initialized the completion structure incorrectly, that led to > the missing completion from being received. The init_completion call should > be made before the crypto_ahash_digest call. The following change to your > test code fixes things. ( I have also fixed what I believe is a typo > aead->ahash) > @@ -74,6 +74,8 @@ static unsigned int kccavs_ahash_op(struct kccavs_ahash_def > *ahash) > { >int rc = 0; > > + init_completion(&ahash->result.completion); > + >rc = crypto_ahash_digest(ahash->req); > >switch (rc) { >@@ -84,7 +86,7 @@ static unsigned int kccavs_ahash_op(struct kccavs_ahash_def >*ahash) >rc = > wait_for_completion_interruptible(&ahash->result.completion); >if (!rc && !ahash->result.err) { > #ifdef OLDASYNC > - INIT_COMPLETION(aead->result.completion); > + INIT_COMPLETION(&ahash->result.completion); > #else > reinit_completion(&ahash->result.completion); > #endif > @@ -95,7 +97,6 @@ static unsigned int kccavs_ahash_op(struct kccavs_ahash_def > *ahash) >" %d\n",rc, ahash->result.err); >break; > } > - init_completion(&ahash->result.completion); > > return rc; >} > This initialization of the completion structure happens correctly in the > tcrypt test module used by the kernel, hence I did not come across this issue > earlier. > Thanks, > Megha Ciao Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: sha1_mb broken
-Original Message- From: Stephan Mueller [mailto:smuel...@chronox.de] Sent: Wednesday, September 28, 2016 11:46 AM To: Dey, Megha Cc: linux-crypto@vger.kernel.org; tim.c.c...@linux.intel.com Subject: Re: sha1_mb broken Am Mittwoch, 28. September 2016, 11:25:47 CEST schrieb Megha Dey: Hi Megha, > Hi Stephan, > > There was a bug fix: Commit ID : 0851561d (introduced in 4.6-rc5). I use the current cryptodev-2.6 tree. > > Assuming that you are using an older kernel than this one, maybe we > are issuing the complete with the wrong pointer, so the original > issuer of the request never gets the complete back. > > If you are using an older kernel, can you please use the lastest and > let me know if you still see this issue? > > Also can you give more info on the test case? Does it issue single > request or multiple requests? It is a single test with the message "8c899bba" where I expect "ac6d8c4851beacf61c175aed0699053d8f632df8" >> For the message 8c899bba, the expected hash is not >> ac6d8c4851beacf61c175aed0699053d8f632df8 but 2e2816f6241fef89d78dd7afc926adc03ca94ace. I used this hash value in the existing tcrypt test and the test passes. I would like to duplicate the test case you are using so that I can have a better look. Can you provide the complete code for the test case? The code you sent earlier are missing some structure definitions. I will try to incorporate this test in the tcrypt test suite and try to reproduce this issue. The code is the following which works with any other hash: struct kccavs_ahash_def { struct crypto_ahash *tfm; struct ahash_request *req; struct kccavs_tcrypt_res result; }; /* Callback function */ static void kccavs_ahash_cb(struct crypto_async_request *req, int error) { struct kccavs_tcrypt_res *result = req->data; if (error == -EINPROGRESS) return; result->err = error; complete(&result->completion); dbg("Encryption finished successfully\n"); } /* Perform hash */ static unsigned int kccavs_ahash_op(struct kccavs_ahash_def *ahash) { int rc = 0; rc = crypto_ahash_digest(ahash->req); switch (rc) { case 0: break; case -EINPROGRESS: case -EBUSY: rc = wait_for_completion_interruptible(&ahash->result.completion); if (!rc && !ahash->result.err) { #ifdef OLDASYNC INIT_COMPLETION(aead->result.completion); #else reinit_completion(&ahash->result.completion); #endif break; } default: dbg("ahash cipher operation returned with %d result" " %d\n",rc, ahash->result.err); break; } init_completion(&ahash->result.completion); return rc; } static int kccavs_test_ahash(size_t nbytes) { int ret; struct crypto_ahash *tfm; struct ahash_request *req = NULL; struct kccavs_ahash_def ahash; struct kccavs_data *data = &kccavs_test->data; struct kccavs_data *key = &kccavs_test->key; unsigned char *digest = NULL; struct scatterlist sg; /* * We explicitly do not check the input buffer as we allow * an empty string. */ /* allocate synchronous hash */ tfm = crypto_alloc_ahash(kccavs_test->name, 0, 0); if (IS_ERR(tfm)) { pr_info("could not allocate digest TFM handle for %s\n", kccavs_test- >name); return PTR_ERR(tfm); } digest = kzalloc(crypto_ahash_digestsize(tfm), GFP_KERNEL); if (!digest) { ret = -ENOMEM; goto out; } req = ahash_request_alloc(tfm, GFP_KERNEL); if (!req) { pr_info("could not allocate request queue\n"); ret = -ENOMEM; goto out; } ahash.tfm = tfm; ahash.req = req; if (key->len) { dbg("set key for HMAC\n"); ret = crypto_ahash_setkey(tfm, key->data, key->len); if (ret < 0) goto out; } sg_init_one(&sg, data->data, data->len); ahash_request_set_crypt(req, &sg, digest, data->len); ahash_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG, kccavs_ahash_cb, &ahash.result); ret = kccavs_ahash_op(&ahash); if (!ret) { data->len = crypto_ahash_digestsize(tfm); memcpy(data->data, digest, data->len); } out: kzfree(digest); if (req)
RE: [PATCH] crypto : async implementation for sha1-mb
-Original Message- From: Herbert Xu [mailto:herb...@gondor.apana.org.au] Sent: Monday, June 13, 2016 1:22 AM To: Dey, Megha Cc: tim.c.c...@linux.intel.com; da...@davemloft.net; linux-crypto@vger.kernel.org; linux-ker...@vger.kernel.org; Yu, Fenghua ; Megha Dey Subject: Re: [PATCH] crypto : async implementation for sha1-mb On Tue, Jun 07, 2016 at 11:14:42AM -0700, Megha Dey wrote: > > - desc->tfm = child; > - desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP; > + ahash_request_set_tfm(desc, child); > + ahash_request_set_callback(desc, CRYPTO_TFM_REQ_MAY_SLEEP, NULL, > +NULL); Why are the callbacks set to NULL/NULL? The child is async so you should have a valid callback function here. Instead of continuing to do the broken callback handling outside of the API (i.e., rctx->complete) please use the API mechanism that is provided for this purpose. > In the current implementation, the inner algorithm is called directly, and we > use the outer algorithm's callback. We do not use the callback in inner > algorithm. We are actually calling the child functions directly and the child > is using the parent's call back function. Probably I can add a comment before > the set callback function.. will this be ok? Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V2 2/2] crypto : async implementation for sha1-mb
-Original Message- From: Herbert Xu [mailto:herb...@gondor.apana.org.au] Sent: Tuesday, June 7, 2016 3:35 AM To: Dey, Megha Cc: tim.c.c...@linux.intel.com; da...@davemloft.net; linux-crypto@vger.kernel.org; linux-ker...@vger.kernel.org; Yu, Fenghua ; Megha Dey Subject: Re: [PATCH V2 2/2] crypto : async implementation for sha1-mb On Thu, Jun 02, 2016 at 07:53:50PM -0700, Megha Dey wrote: > > + struct ahash_alg *shash = crypto_ahash_alg(tfm); > > /* alignment is to be done by multi-buffer crypto algorithm if > needed */ > > - return shash->finup(desc, NULL, 0, req->result); > + return shash->finup(desc); You're still poking in the guts of the API. Now that it's a real ahash you don't need to do that. Just do crypto_ahash_finup. That way you don't need to export crypto_ahsh_alg either. > I have made these changes and re-sent the patch. Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/2] crypto : async implementation for sha1-mb
-Original Message- From: Herbert Xu [mailto:herb...@gondor.apana.org.au] Sent: Thursday, June 2, 2016 5:33 PM To: Dey, Megha Cc: tim.c.c...@linux.intel.com; da...@davemloft.net; linux-crypto@vger.kernel.org; linux-ker...@vger.kernel.org; Yu, Fenghua Subject: Re: [PATCH 2/2] crypto : async implementation for sha1-mb On Thu, Jun 02, 2016 at 10:20:20AM -0700, Megha Dey wrote: > > > > @@ -439,17 +444,18 @@ static int mcryptd_hash_finup_enqueue(struct > > > ahash_request *req) static void mcryptd_hash_digest(struct > > > crypto_async_request *req_async, int err) { > > > struct mcryptd_hash_ctx *ctx = crypto_tfm_ctx(req_async->tfm); > > > - struct crypto_shash *child = ctx->child; > > > + struct crypto_ahash *child = ctx->child; > > > struct ahash_request *req = ahash_request_cast(req_async); > > > struct mcryptd_hash_request_ctx *rctx = ahash_request_ctx(req); > > > - struct shash_desc *desc = &rctx->desc; > > > + struct ahash_request *desc = &rctx->areq; > > > + struct crypto_async_request *base = &desc->base; > > > > > > if (unlikely(err == -EINPROGRESS)) > > > goto out; > > > + base->tfm = &child->base; > > > + base->flags = CRYPTO_TFM_REQ_MAY_SLEEP; /* check this again */ > > > > You should not be touching crypto_async_request directly. Use the > > proper ahash interface to set the child request. > > > Herbert, Could you please clarify? > In the earlier code we had a async_request which is now replaced by > crypto_async_request. Do you want a new async_request to be used? > Do you think we shouldn't be setting the members of the > crypto_ahash_request directly, but use some other interface to do the > same for us? You already have an ahash_request here. So you should be doing ahash_request_set_tfm(...) ahash_request_set_callback(...) >ok,done! Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html