Re: [RFC V2 0/5] Introduce AVX512 optimized crypto algorithms

2021-02-23 Thread Dey, Megha

Hi Andy,

On 1/24/2021 8:23 AM, Andy Lutomirski wrote:

On Fri, Jan 22, 2021 at 11:29 PM Megha Dey  wrote:

Optimize crypto algorithms using AVX512 instructions - VAES and VPCLMULQDQ
(first implemented on Intel's Icelake client and Xeon CPUs).

These algorithms take advantage of the AVX512 registers to keep the CPU
busy and increase memory bandwidth utilization. They provide substantial
(2-10x) improvements over existing crypto algorithms when update data size
is greater than 128 bytes and do not have any significant impact when used
on small amounts of data.

However, these algorithms may also incur a frequency penalty and cause
collateral damage to other workloads running on the same core(co-scheduled
threads). These frequency drops are also known as bin drops where 1 bin
drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin
drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)
are observed on the Icelake server.

The AVX512 optimization are disabled by default to avoid impact on other
workloads. In order to use these optimized algorithms:
1. At compile time:
a. User must enable CONFIG_CRYPTO_AVX512 option
b. Toolchain(assembler) must support VPCLMULQDQ and VAES instructions
2. At run time:
a. User must set module parameter use_avx512 at boot time
b. Platform must support VPCLMULQDQ and VAES features

N.B. It is unclear whether these coarse grain controls(global module
parameter) would meet all user needs. Perhaps some per-thread control might
be useful? Looking for guidance here.


I've just been looking at some performance issues with in-kernel AVX,
and I have a whole pile of questions that I think should be answered
first:

What is the impact of using an AVX-512 instruction on the logical
thread, its siblings, and other cores on the package?

Does the impact depend on whether it’s a 512-bit insn or a shorter EVEX insn?

What is the impact on subsequent shorter EVEX, VEX, and legacy
SSE(2,3, etc) insns?

How does VZEROUPPER figure in?  I can find an enormous amount of
misinformation online, but nothing authoritative.

What is the effect of the AVX-512 states (5-7) being “in use”?  As far
as I can tell, the only operations that clear XINUSE[5-7] are XRSTOR
and its variants.  Is this correct?

On AVX-512 capable CPUs, do we ever get a penalty for executing a
non-VEX insn followed by a large-width EVEX insn without an
intervening VZEROUPPER?  The docs suggest no, since Broadwell and
before don’t support EVEX, but I’d like to know for sure.


My current opinion is that we should not enable AVX-512 in-kernel
except on CPUs that we determine have good AVX-512 support.  Based on
some reading, that seems to mean Ice Lake Client and not anything
before it.  I also think a bunch of the above questions should be
answered before we do any of this.  Right now we have a regression of
unknown impact in regular AVX support in-kernel, we will have
performance issues in-kernel depending on what user code has done
recently, and I'm still trying to figure out what to do about it.
Throwing AVX-512 into the mix without real information is not going to
improve the situation.


We are currently working on providing you with answers on the questions 
you have raised regarding AVX.


Thanks,

Megha



Re: [RFC V1 7/7] crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ

2021-01-20 Thread Dey, Megha

Hi Ard,

On 1/16/2021 9:16 AM, Ard Biesheuvel wrote:

On Fri, 18 Dec 2020 at 22:08, Megha Dey  wrote:

Introduce the AVX512 implementation that optimizes the AESNI-GCM encode
and decode routines using VPCLMULQDQ.

The glue code in AESNI module overrides the existing AVX2 GCM mode
encryption/decryption routines with the AX512 AES GCM mode ones when the
following criteria are met:
At compile time:
1. CONFIG_CRYPTO_AVX512 is enabled
2. toolchain(assembler) supports VPCLMULQDQ instructions
At runtime:
1. VPCLMULQDQ and AVX512VL features are supported on a platform
(currently only Icelake)
2. aesni_intel.use_avx512 module parameter is set at boot time. For this
algorithm, switching from AVX512 optimized version is not possible
once set at boot time because of how the code is structured today.(Can
be changed later if required)

The functions aesni_gcm_init_avx_512, aesni_gcm_enc_update_avx_512,
aesni_gcm_dec_update_avx_512 and aesni_gcm_finalize_avx_512 are adapted
from the Intel Optimized IPSEC Cryptographic library.

On a Icelake desktop, with turbo disabled and all CPUs running at
maximum frequency, the AVX512 GCM mode optimization shows better
performance across data & key sizes as measured by tcrypt.

The average performance improvement of the AVX512 version over the AVX2
version is as follows:
For all key sizes(128/192/256 bits),
 data sizes < 128 bytes/block, negligible improvement (~7.5%)
 data sizes > 128 bytes/block, there is an average improvement of
 40% for both encryption and decryption.

A typical run of tcrypt with AES GCM mode encryption/decryption of the
AVX2 and AVX512 optimization on a Icelake desktop shows the following
results:

   --
   |   key  | bytes | cycles/op (lower is better)   | Percentage gain/  |
   | length |   per |   encryption  |  decryption   |  loss |
   | (bits) | block |---|---|
   ||   | avx2 | avx512 | avx2 | avx512 | Encrypt | Decrypt |
   |-
   |  128   | 16| 689  |  701   | 689  |  707   |  -1.7   |  -2.61  |
   |  128   | 64| 731  |  660   | 771  |  649   |   9.7   |  15.82  |
   |  128   | 256   | 911  |  750   | 900  |  721   |  17.67  |  19.88  |
   |  128   | 512   | 1181 |  814   | 1161 |  782   |  31.07  |  32.64  |
   |  128   | 1024  | 1676 |  1052  | 1685 |  1030  |  37.23  |  38.87  |
   |  128   | 2048  | 2475 |  1447  | 2456 |  1419  |  41.53  |  42.22  |
   |  128   | 4096  | 3806 |  2154  | 3820 |  2119  |  43.41  |  44.53  |
   |  128   | 8192  | 9169 |  3806  | 6997 |  3718  |  58.49  |  46.86  |
   |  192   | 16| 754  |  683   | 737  |  672   |   9.42  |   8.82  |
   |  192   | 64| 735  |  686   | 715  |  640   |   6.66  |  10.49  |
   |  192   | 256   | 949  |  738   | 2435 |  729   |  22.23  |  70 |
   |  192   | 512   | 1235 |  854   | 1200 |  833   |  30.85  |  30.58  |
   |  192   | 1024  | 1777 |  1084  | 1763 |  1051  |  38.99  |  40.39  |
   |  192   | 2048  | 2574 |  1497  | 2592 |  1459  |  41.84  |  43.71  |
   |  192   | 4096  | 4086 |  2317  | 4091 |  2244  |  43.29  |  45.14  |
   |  192   | 8192  | 7481 |  4054  | 7505 |  3953  |  45.81  |  47.32  |
   |  256   | 16| 755  |  682   | 720  |  683   |   9.68  |   5.14  |
   |  256   | 64| 744  |  677   | 719  |  658   |   9 |   8.48  |
   |  256   | 256   | 962  |  758   | 948  |  749   |  21.21  |  21 |
   |  256   | 512   | 1297 |  862   | 1276 |  836   |  33.54  |  34.48  |
   |  256   | 1024  | 1831 |  1114  | 1819 |  1095  |  39.16  |  39.8   |
   |  256   | 2048  | 2767 |  1566  | 2715 |  1524  |  43.4   |  43.87  |
   |  256   | 4096  | 4378 |  2382  | 4368 |  2354  |  45.6   |  46.11  |
   |  256   | 8192  | 8075 |  4262  | 8080 |  4186  |  47.22  |  48.19  |
   --

This work was inspired by the AES GCM mode optimization published in
Intel Optimized IPSEC Cryptographic library.
https://github.com/intel/intel-ipsec-mb/blob/master/lib/avx512/gcm_vaes_avx512.asm

Co-developed-by: Tomasz Kantecki 
Signed-off-by: Tomasz Kantecki 
Signed-off-by: Megha Dey 
---
  arch/x86/crypto/Makefile|1 +
  arch/x86/crypto/aesni-intel_avx512-x86_64.S | 1788 +++
  arch/x86/crypto/aesni-intel_glue.c  |   62 +-
  crypto/Kconfig  |   12 +
  4 files changed, 1858 insertions(+), 5 deletions(-)
  create mode 100644 arch/x86/crypto/aesni-intel_avx512-x86_64.S


...

diff --git a/arch/x86/crypto/aesni-intel_glue.c 
b/arch/x86/crypto/aesni-intel_glue.c
index 9e56cdf..8fc5bac 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -55,13 +55,16 @@ MODULE_PARM_DESC(use_avx512, "Use AVX512 optimized algorithm, if 
available");
   * This needs to be 16 b

Re: [RFC V1 1/7] x86: Probe assembler capabilities for VAES and VPLCMULQDQ support

2021-01-20 Thread Dey, Megha

Hi Ard,

On 1/16/2021 8:54 AM, Ard Biesheuvel wrote:

On Fri, 18 Dec 2020 at 22:07, Megha Dey  wrote:

This is a preparatory patch to introduce the optimized crypto algorithms
using AVX512 instructions which would require VAES and VPLCMULQDQ support.

Check for VAES and VPCLMULQDQ assembler support using AVX512 registers.

Cc: x...@kernel.org
Signed-off-by: Megha Dey 
---
  arch/x86/Kconfig.assembler | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
index 26b8c08..9ea0bc8 100644
--- a/arch/x86/Kconfig.assembler
+++ b/arch/x86/Kconfig.assembler
@@ -1,6 +1,16 @@
  # SPDX-License-Identifier: GPL-2.0
  # Copyright (C) 2020 Jason A. Donenfeld . All Rights 
Reserved.

+config AS_VAES_AVX512
+   def_bool $(as-instr,vaesenc %zmm0$(comma)%zmm1$(comma)%zmm1) && 64BIT

Is the '&& 64BIT' necessary here, but not below?

In any case, better to use a separate 'depends on' line, for legibility


yeah , I think the '&& 64 BIT' is not required. I will remove it in the 
next version.


-Megha




+   help
+ Supported by binutils >= 2.30 and LLVM integrated assembler
+
+config AS_VPCLMULQDQ
+   def_bool $(as-instr,vpclmulqdq 
\$0$(comma)%zmm2$(comma)%zmm6$(comma)%zmm4)
+   help
+ Supported by binutils >= 2.30 and LLVM integrated assembler
+
  config AS_AVX512
 def_bool $(as-instr,vpmovm2b %k1$(comma)%zmm5)
 help
--
2.7.4



Re: [RFC V1 5/7] crypto: aesni - AES CTR x86_64 "by16" AVX512 optimization

2021-01-20 Thread Dey, Megha

Hi Ard,

On 1/16/2021 9:03 AM, Ard Biesheuvel wrote:

On Fri, 18 Dec 2020 at 22:08, Megha Dey  wrote:

Introduce the "by16" implementation of the AES CTR mode using AVX512
optimizations. "by16" means that 16 independent blocks (each block
being 128 bits) can be ciphered simultaneously as opposed to the
current 8 blocks.

The glue code in AESNI module overrides the existing "by8" CTR mode
encryption/decryption routines with the "by16" ones when the following
criteria are met:
At compile time:
1. CONFIG_CRYPTO_AVX512 is enabled
2. toolchain(assembler) supports VAES instructions
At runtime:
1. VAES and AVX512VL features are supported on platform (currently
only Icelake)
2. aesni_intel.use_avx512 module parameter is set at boot time. For this
algorithm, switching from AVX512 optimized version is not possible once
set at boot time because of how the code is structured today.(Can be
changed later if required)

The functions aes_ctr_enc_128_avx512_by16(), aes_ctr_enc_192_avx512_by16()
and aes_ctr_enc_256_avx512_by16() are adapted from Intel Optimized IPSEC
Cryptographic library.

On a Icelake desktop, with turbo disabled and all CPUs running at maximum
frequency, the "by16" CTR mode optimization shows better performance
across data & key sizes as measured by tcrypt.

The average performance improvement of the "by16" version over the "by8"
version is as follows:
For all key sizes(128/192/256 bits),
 data sizes < 128 bytes/block, negligible improvement(~3% loss)
 data sizes > 128 bytes/block, there is an average improvement of
48% for both encryption and decryption.

A typical run of tcrypt with AES CTR mode encryption/decryption of the
"by8" and "by16" optimization on a Icelake desktop shows the following
results:

--
|  key   | bytes | cycles/op (lower is better)| percentage   |
| length |  per  |  encryption  |  decryption |  loss/gain   |
| (bits) | block |---|
||   | by8  | by16  | by8  | by16 |  enc | dec   |
||
|  128   |  16   | 156  | 168   | 164  | 168  | -7.7 |  -2.5 |
|  128   |  64   | 180  | 190   | 157  | 146  | -5.6 |   7.1 |
|  128   |  256  | 248  | 158   | 251  | 161  | 36.3 |  35.9 |
|  128   |  1024 | 633  | 316   | 642  | 319  | 50.1 |  50.4 |
|  128   |  1472 | 853  | 411   | 877  | 407  | 51.9 |  53.6 |
|  128   |  8192 | 4463 | 1959  | 4447 | 1940 | 56.2 |  56.4 |
|  192   |  16   | 136  | 145   | 149  | 166  | -6.7 | -11.5 |
|  192   |  64   | 159  | 154   | 157  | 160  |  3.2 |  -2   |
|  192   |  256  | 268  | 172   | 274  | 177  | 35.9 |  35.5 |
|  192   |  1024 | 710  | 358   | 720  | 355  | 49.6 |  50.7 |
|  192   |  1472 | 989  | 468   | 983  | 469  | 52.7 |  52.3 |
|  192   |  8192 | 6326 | 3551  | 6301 | 3567 | 43.9 |  43.4 |
|  256   |  16   | 153  | 165   | 139  | 156  | -7.9 | -12.3 |
|  256   |  64   | 158  | 152   | 174  | 161  |  3.8 |   7.5 |
|  256   |  256  | 283  | 176   | 287  | 202  | 37.9 |  29.7 |
|  256   |  1024 | 797  | 393   | 807  | 395  | 50.7 |  51.1 |
|  256   |  1472 | 1108 | 534   | 1107 | 527  | 51.9 |  52.4 |
|  256   |  8192 | 5763 | 2616  | 5773 | 2617 | 54.7 |  54.7 |
--

This work was inspired by the AES CTR mode optimization published
in Intel Optimized IPSEC Cryptographic library.
https://github.com/intel/intel-ipsec-mb/blob/master/lib/avx512/cntr_vaes_avx512.asm

Co-developed-by: Tomasz Kantecki 
Signed-off-by: Tomasz Kantecki 
Signed-off-by: Megha Dey 
---
  arch/x86/crypto/Makefile|   1 +
  arch/x86/crypto/aes_ctrby16_avx512-x86_64.S | 856 
  arch/x86/crypto/aesni-intel_glue.c  |  57 +-
  arch/x86/crypto/avx512_vaes_common.S| 422 ++
  arch/x86/include/asm/disabled-features.h|   8 +-
  crypto/Kconfig  |  12 +
  6 files changed, 1354 insertions(+), 2 deletions(-)
  create mode 100644 arch/x86/crypto/aes_ctrby16_avx512-x86_64.S


...

diff --git a/arch/x86/crypto/aesni-intel_glue.c 
b/arch/x86/crypto/aesni-intel_glue.c
index ad8a718..f45059e 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -46,6 +46,10 @@
  #define CRYPTO_AES_CTX_SIZE (sizeof(struct crypto_aes_ctx) + 
AESNI_ALIGN_EXTRA)
  #define XTS_AES_CTX_SIZE (sizeof(struct aesni_xts_ctx) + AESNI_ALIGN_EXTRA)

+static bool use_avx512;
+module_param(use_avx512, bool, 0644);
+MODULE_PARM_DESC(use_avx512, "Use AVX512 optimized algorithm, if available");
+
  /* This data is stored at the end of the crypto_tfm struct.
   * It's a type of per "session" data storage location.
   * This needs to be 16 byte aligned.
@@ -191,6 +195,35 @@ asmlinkage void aes_ctr_enc_192_avx_by8(const u8 *in, u8 
*iv,
 void *keys, u8 *out, unsigned int num_bytes);
  asmlinkage void aes_ctr_enc_256_avx

Re: [RFC V1 2/7] crypto: crct10dif - Accelerated CRC T10 DIF with vectorized instruction

2021-01-20 Thread Dey, Megha

Hi Ard,

On 1/16/2021 9:00 AM, Ard Biesheuvel wrote:

On Fri, 18 Dec 2020 at 22:07, Megha Dey  wrote:

From: Kyung Min Park 

Update the crc_pcl function that calculates T10 Data Integrity Field
CRC16 (CRC T10 DIF) using VPCLMULQDQ instruction. VPCLMULQDQ instruction
with AVX-512F adds EVEX encoded 512 bit version of PCLMULQDQ instruction.
The advantage comes from packing multiples of 4 * 128 bit data into AVX512
reducing instruction latency.

The glue code in crct10diff module overrides the existing PCLMULQDQ version
with the VPCLMULQDQ version when the following criteria are met:
At compile time:
1. CONFIG_CRYPTO_AVX512 is enabled
2. toolchain(assembler) supports VPCLMULQDQ instructions
At runtime:
1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently
only Icelake)
2. If compiled as built-in module, crct10dif_pclmul.use_avx512 is set at
boot time or /sys/module/crct10dif_pclmul/parameters/use_avx512 is set
to 1 after boot.
If compiled as loadable module, use_avx512 module parameter must be set:
modprobe crct10dif_pclmul use_avx512=1

A typical run of tcrypt with CRC T10 DIF calculation with PCLMULQDQ
instruction and VPCLMULQDQ instruction shows the following results:
For bytes per update >= 1KB, we see the average improvement of 46%(~1.4x)
For bytes per update < 1KB, we see the average improvement of 13%.
Test was performed on an Icelake based platform with constant frequency
set for CPU.

Detailed results for a variety of block sizes and update sizes are in
the table below.

---
||| cycles/operation ||
|||   (the lower the better) ||
|byte|   bytes|--| percentage |
|   blocks   | per update |   CRC T10 DIF  |  CRC T10 DIF| loss/gain  |
||| with PCLMULQDQ | with VPCLMULQDQ ||
||||-||
|  16| 16 |77  |106  |   -27.0|
|  64| 16 |   411  |390  | 5.4|
|  64| 64 |71  | 85  |   -16.0|
| 256| 16 |  1224  |   1308  |-6.4|
| 256| 64 |   393  |407  |-3.4|
| 256|256 |93  | 86  | 8.1|
|1024| 16 |  4564  |   5020  |-9.0|
|1024|256 |   486  |475  | 2.3|
|1024|   1024 |   221  |148  |49.3|
|2048| 16 |  8945  |   9851  |-9.1|
|2048|256 |   982  |951  | 3.3|
|2048|   1024 |   500  |369  |35.5|
|2048|   2048 |   413  |265  |55.8|
|4096| 16 | 17885  |  19351  |-7.5|
|4096|256 |  1828  |   1713  | 6.7|
|4096|   1024 |   968  |805  |20.0|
|4096|   4096 |   739  |475  |55.6|
|8192| 16 | 48339  |  41556  |16.3|
|8192|256 |  3494  |   3342  | 4.5|
|8192|   1024 |  1959  |   1462  |34.0|
|8192|   4096 |  1561  |   1036  |50.7|
|8192|   8192 |  1540  |   1004  |53.4|
---

This work was inspired by the CRC T10 DIF AVX512 optimization published
in Intel Intelligent Storage Acceleration Library.
https://github.com/intel/isa-l/blob/master/crc/crc16_t10dif_by16_10.asm

Co-developed-by: Greg Tucker 
Signed-off-by: Greg Tucker 
Co-developed-by: Tomasz Kantecki 
Signed-off-by: Tomasz Kantecki 
Signed-off-by: Kyung Min Park 
Signed-off-by: Megha Dey 
---
  arch/x86/crypto/Makefile  |   1 +
  arch/x86/crypto/crct10dif-avx512-asm_64.S | 482 ++
  arch/x86/crypto/crct10dif-pclmul_glue.c   |  24 +-
  arch/x86/include/asm/disabled-features.h  |   8 +-
  crypto/Kconfig|  23 ++
  5 files changed, 535 insertions(+), 3 deletions(-)
  create mode 100644 arch/x86/crypto/crct10dif-avx512-asm_64.S


...

diff --git a/arch/x86/crypto/crct10dif-pclmul_glue.c 
b/arch/x86/crypto/crct10dif-pclmul_glue.c
index 71291d5a..26a6350 100644
--- a/arch/x86/crypto/crct10dif-pclmul_glue.c
+++ b/arch/x86/crypto/crct10dif-pclmul_glue.c
@@ -35,6 +35,16 @@
  #include 

  asmlinkage u16 crc_t10dif_pcl(u16 init_crc, const u8 *buf, size_t len);
+#ifdef CONFIG_CRYPTO_CRCT10DIF_AVX512
+asmlinkage u

Re: [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms

2021-01-16 Thread Dey, Megha

Hi Ard,

On 1/16/2021 8:52 AM, Ard Biesheuvel wrote:

On Mon, 28 Dec 2020 at 20:11, Dey, Megha  wrote:

Hi Eric,

On 12/21/2020 3:20 PM, Eric Biggers wrote:

On Fri, Dec 18, 2020 at 01:10:57PM -0800, Megha Dey wrote:

Optimize crypto algorithms using VPCLMULQDQ and VAES AVX512 instructions
(first implemented on Intel's Icelake client and Xeon CPUs).

These algorithms take advantage of the AVX512 registers to keep the CPU
busy and increase memory bandwidth utilization. They provide substantial
(2-10x) improvements over existing crypto algorithms when update data size
is greater than 128 bytes and do not have any significant impact when used
on small amounts of data.

However, these algorithms may also incur a frequency penalty and cause
collateral damage to other workloads running on the same core(co-scheduled
threads). These frequency drops are also known as bin drops where 1 bin
drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin
drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)
are observed on the Icelake server.


Do these new algorithms all pass the self-tests, including the fuzz tests that
are enabled when CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y?

I had tested these algorithms with CRYPTO_MANAGER_DISABLE_TESTS=n and
tcrypt, not with
CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y (I wasn't aware this existed, my bad).
I see a couple of errors after enabling it and am working on fixing those.


Hello Megha,

I think the GHASH changes can be dropped (as discussed in the other
thread), given the lack of a use case. The existing GHASH driver could
also be removed in the future, but I don't think it needs to be part
of this series.

Ok, I will remove the GHASH patch from the next series.


Could you please rebase this onto the latest AES-NI changes that are
in Herbert's tree? (as well as the ones I sent out today) They address
some issues with indirect calls and excessive disabling of preemption,
and your GCM and CTR changes are definitely going to be affected by
this as well.

Yeah sure, will do, thanks for the headsup!


Re: [RFC V1 3/7] crypto: ghash - Optimized GHASH computations

2021-01-15 Thread Dey, Megha



On 1/15/2021 5:43 PM, Eric Biggers wrote:

On Fri, Jan 15, 2021 at 04:14:40PM -0800, Dey, Megha wrote:

Hello Megha,

What is the purpose of this separate GHASH module? GHASH is only used
in combination with AES-CTR to produce GCM, and this series already
contains a GCM driver.

Do cores exist that implement PCLMULQDQ but not AES-NI?

If not, I think we should be able to drop this patch (and remove the
existing PCLMULQDQ GHASH driver as well)

AFAIK, dm-verity (authenticated but not encrypted file system) is one use
case for authentication only.

Although I am not sure if GHASH is specifically used for this or SHA?

Also, I do not know of any cores that implement PCLMULQDQ and not AES-NI.


dm-verity only uses unkeyed hash algorithms.  So no, it doesn't use GHASH.


Hmm, I see. If that is the case, I am not aware of any other use case 
apart from GCM.


I see that the existing GHASH module in the kernel from 2009. I am not 
sure if there was a use case then, which now is no longer valid.


There many be out-of-tree kernel modules which may be using it but again 
its only speculation.


So, in the next version should I remove the existing GHASH module? (And 
of course remove this patch as well?)


-Megha



- Eric


Re: [RFC V1 3/7] crypto: ghash - Optimized GHASH computations

2021-01-15 Thread Dey, Megha

Hi Ard,

On 12/19/2020 9:03 AM, Ard Biesheuvel wrote:

On Fri, 18 Dec 2020 at 22:07, Megha Dey  wrote:

From: Kyung Min Park 

Optimize GHASH computations with the 512 bit wide VPCLMULQDQ instructions.
The new instruction allows to work on 4 x 16 byte blocks at the time.
For best parallelism and deeper out of order execution, the main loop of
the code works on 16 x 16 byte blocks at the time and performs reduction
every 48 x 16 byte blocks. Such approach needs 48 precomputed GHASH subkeys
and the precompute operation has been optimized as well to leverage 512 bit
registers, parallel carry less multiply and reduction.

VPCLMULQDQ instruction is used to accelerate the most time-consuming
part of GHASH, carry-less multiplication. VPCLMULQDQ instruction
with AVX-512F adds EVEX encoded 512 bit version of PCLMULQDQ instruction.

The glue code in ghash_clmulni_intel module overrides existing PCLMULQDQ
version with the VPCLMULQDQ version when the following criteria are met:
At compile time:
1. CONFIG_CRYPTO_AVX512 is enabled
2. toolchain(assembler) supports VPCLMULQDQ instructions
At runtime:
1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently
only Icelake)
2. If compiled as built-in module, ghash_clmulni_intel.use_avx512 is set at
boot time or /sys/module/ghash_clmulni_intel/parameters/use_avx512 is set
to 1 after boot.
If compiled as loadable module, use_avx512 module parameter must be set:
modprobe ghash_clmulni_intel use_avx512=1

With new implementation, tcrypt ghash speed test shows about 4x to 10x
speedup improvement for GHASH calculation compared to the original
implementation with PCLMULQDQ when the bytes per update size is 256 Bytes
or above. Detailed results for a variety of block sizes and update
sizes are in the table below. The test was performed on Icelake based
platform with constant frequency set for CPU.

The average performance improvement of the AVX512 version over the current
implementation is as follows:
For bytes per update >= 1KB, we see the average improvement of 882%(~8.8x).
For bytes per update < 1KB, we see the average improvement of 370%(~3.7x).

A typical run of tcrypt with GHASH calculation with PCLMULQDQ instruction
and VPCLMULQDQ instruction shows the following results.

---
||| cycles/operation ||
|||   (the lower the better) ||
|byte|   bytes|--| percentage |
|   blocks   | per update |   GHASH test   |   GHASH test| loss/gain  |
||| with PCLMULQDQ | with VPCLMULQDQ ||
||||-||
|  16| 16 |   144  |233  |   -38.0|
|  64| 16 |   535  |709  |   -24.5|
|  64| 64 |   210  |146  |43.8|
| 256| 16 |  1808  |   1911  |-5.4|
| 256| 64 |   865  |581  |48.9|
| 256|256 |   682  |170  |   301.0|
|1024| 16 |  6746  |   6935  |-2.7|
|1024|256 |  2829  |714  |   296.0|
|1024|   1024 |  2543  |341  |   645.0|
|2048| 16 | 13219  |  13403  |-1.3|
|2048|256 |  5435  |   1408  |   286.0|
|2048|   1024 |  5218  |685  |   661.0|
|2048|   2048 |  5061  |565  |   796.0|
|4096| 16 | 40793  |  27615  |47.8|
|4096|256 | 10662  |   2689  |   297.0|
|4096|   1024 | 10196  |   1333  |   665.0|
|4096|   4096 | 10049  |   1011  |   894.0|
|8192| 16 | 51672  |  54599  |-5.3|
|8192|256 | 21228  |   5284  |   301.0|
|8192|   1024 | 20306  |   2556  |   694.0|
|8192|   4096 | 20076  |   2044  |   882.0|
|8192|   8192 | 20071  |   2017  |   895.0|
---

This work was inspired by the AES GCM mode optimization published
in Intel Optimized IPSEC Cryptographic library.
https://github.com/intel/intel-ipsec-mb/lib/avx512/gcm_vaes_avx512.asm

Co-developed-by: Greg Tucker 
Signed-off-by: Greg Tucker 
Co-developed-by: Tomasz Kantecki 
Signed-off-by: Tomasz Kantecki 
Signed-off-by: Kyung Min Park 
Co-developed-by: Megha Dey 
Signed-off-by: Megha Dey 

Hello Megha,

What is the purpose of this s

Re: [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms

2020-12-28 Thread Dey, Megha

Hi Eric,

On 12/21/2020 3:20 PM, Eric Biggers wrote:

On Fri, Dec 18, 2020 at 01:10:57PM -0800, Megha Dey wrote:

Optimize crypto algorithms using VPCLMULQDQ and VAES AVX512 instructions
(first implemented on Intel's Icelake client and Xeon CPUs).

These algorithms take advantage of the AVX512 registers to keep the CPU
busy and increase memory bandwidth utilization. They provide substantial
(2-10x) improvements over existing crypto algorithms when update data size
is greater than 128 bytes and do not have any significant impact when used
on small amounts of data.

However, these algorithms may also incur a frequency penalty and cause
collateral damage to other workloads running on the same core(co-scheduled
threads). These frequency drops are also known as bin drops where 1 bin
drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin
drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)
are observed on the Icelake server.


Do these new algorithms all pass the self-tests, including the fuzz tests that
are enabled when CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y?


I had tested these algorithms with CRYPTO_MANAGER_DISABLE_TESTS=n and 
tcrypt, not with

CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y (I wasn't aware this existed, my bad).
I see a couple of errors after enabling it and am working on fixing those.

Megha



- Eric


RE: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure support

2018-06-20 Thread Dey, Megha



>-Original Message-
>From: Dey, Megha
>Sent: Friday, May 11, 2018 6:22 PM
>To: Herbert Xu 
>Cc: linux-ker...@vger.kernel.org; linux-crypto@vger.kernel.org;
>da...@davemloft.net
>Subject: RE: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure
>support
>
>
>
>>-Original Message-
>>From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
>>Sent: Thursday, May 10, 2018 9:46 PM
>>To: Dey, Megha 
>>Cc: linux-ker...@vger.kernel.org; linux-crypto@vger.kernel.org;
>>da...@davemloft.net
>>Subject: Re: [PATCH V8 1/5] crypto: Multi-buffer encryption
>>infrastructure support
>>
>>On Fri, May 11, 2018 at 01:24:42AM +, Dey, Megha wrote:
>>>
>>> Are you suggesting that the SIMD wrapper, will do what is currently
>>> being
>>done by the ' mcryptd_queue_worker ' function (assuming FPU is not
>>disabled) i.e dispatching the job to the inner algorithm?
>>>
>>> I have got rid of the mcryptd layer( have an inner layer, outer SIMD
>>> layer,
>>handled the pointers and completions accordingly), but still facing
>>some issues after removing the per cpu mcryptd_cpu_queue.
>>
>>Why don't you post what you've got and we can work it out together?
>
>Hi Herbert,
>
>Sure, I will post an RFC patch. (crypto: Remove mcryptd).

Hi Herbert,

I had posted an RFC patch about a month back (2018/05/11). Could you please 
have a look?

[RFC] crypto: Remove mcryptd
>
>>
>>Thanks,
>>--
>>Email: Herbert Xu  Home Page:
>>http://gondor.apana.org.au/~herbert/
>>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


RE: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure support

2018-05-11 Thread Dey, Megha


>-Original Message-
>From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
>Sent: Thursday, May 10, 2018 9:46 PM
>To: Dey, Megha 
>Cc: linux-ker...@vger.kernel.org; linux-crypto@vger.kernel.org;
>da...@davemloft.net
>Subject: Re: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure
>support
>
>On Fri, May 11, 2018 at 01:24:42AM +, Dey, Megha wrote:
>>
>> Are you suggesting that the SIMD wrapper, will do what is currently being
>done by the ' mcryptd_queue_worker ' function (assuming FPU is not disabled)
>i.e dispatching the job to the inner algorithm?
>>
>> I have got rid of the mcryptd layer( have an inner layer, outer SIMD layer,
>handled the pointers and completions accordingly), but still facing some issues
>after removing the per cpu mcryptd_cpu_queue.
>
>Why don't you post what you've got and we can work it out together?

Hi Herbert,

Sure, I will post an RFC patch. (crypto: Remove mcryptd). 

>
>Thanks,
>--
>Email: Herbert Xu  Home Page:
>http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


RE: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure support

2018-05-10 Thread Dey, Megha


>-Original Message-
>From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
>Sent: Monday, May 7, 2018 2:35 AM
>To: Dey, Megha 
>Cc: linux-ker...@vger.kernel.org; linux-crypto@vger.kernel.org;
>da...@davemloft.net
>Subject: Re: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure
>support
>
>On Tue, May 01, 2018 at 10:39:15PM +, Dey, Megha wrote:
>>
>> crypto/simd.c provides a simd_skcipher_create_compat. I have used the
>> same template to introduce simd_ahash_create_compat which would wrap
>around the inner hash algorithm.
>>
>> Hence we would still register 2 algs, outer and inner.
>
>Right.
>
>> Currently we have outer_alg -> mcryptd alg -> inner_alg
>>
>> Mcryptd is mainly providing the following:
>> 1. Ensuring the lanes(8 in case of AVX2) are full before dispatching to the
>lower inner algorithm. This is obviously why we would expect better
>performance for multi-buffer as opposed to the present single-buffer
>algorithms.
>> 2. If there no new incoming jobs, issue a flush.
>> 3. A glue layer which sends the correct pointers and completions.
>>
>> If we get rid of mcryptd, these functions needs to be done by someone. Since
>all multi-buffer algorithms would require this tasks, where do you suggest 
>these
>helpers live, if not the current mcryptd.c?
>
>That's the issue.  I don't think mcryptd is doing any of these claimed 
>functions
>except for hosting the flush helper which could really live anywhere.
>
>All these functions are actually being carried out in the inner algorithm 
>already.
>
>> I am not sure if you are suggesting that we need to get rid of the mcryptd
>work queue itself. In that case, we would need to execute in the context of the
>job requesting the crypto transformation.
>
>Which is fine as long as you can disable the FPU.  If not the simd wrapper will
>defer the job to kthread context as required.

Hi Herbert,

Are you suggesting that the SIMD wrapper, will do what is currently being done 
by the ' mcryptd_queue_worker ' function (assuming FPU is not disabled) i.e 
dispatching the job to the inner algorithm?

I have got rid of the mcryptd layer( have an inner layer, outer SIMD layer, 
handled the pointers and completions accordingly), but still facing some issues 
after removing the per cpu mcryptd_cpu_queue.
 
>
>Cheers,
>--
>Email: Herbert Xu  Home Page:
>http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


RE: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure support

2018-05-01 Thread Dey, Megha


>-Original Message-
>From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
>Sent: Thursday, April 26, 2018 2:45 AM
>To: Dey, Megha 
>Cc: linux-ker...@vger.kernel.org; linux-crypto@vger.kernel.org;
>da...@davemloft.net
>Subject: Re: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure
>support
>
>On Wed, Apr 25, 2018 at 01:14:26AM +, Dey, Megha wrote:
>>
>> Is there any existing implementation of async crypto algorithm that uses the
>above approach? The ones I could find are either sync, have an outer and
>inner algorithm or use cryptd.
>>
>> I tried removing the mcryptd layer and the outer algorithm and some
>> plumbing to pass the correct structures, but see crashes.(obviously
>> some errors in the plumbing)
>
>OK, you can't just remove it because the inner algorithm requires
>kernel_fpu_begin/kernel_fpu_end.  So we do need two layers but I don't think
>we need cryptd or mcryptd.
>
>The existing simd wrapper should work just fine on the inner algorithm,
>provided that we add hash support to it.

Hi Herbert,

crypto/simd.c provides a simd_skcipher_create_compat. I have used the same 
template to introduce simd_ahash_create_compat
which would wrap around the inner hash algorithm.

Hence we would still register 2 algs, outer and inner.
>
>> I am not sure if we remove mcryptd, how would we queue work, flush
>partially completed jobs or call completions (currently done by mcryptd) if we
>simply call the inner algorithm.
>
>I don't think mcryptd is providing any real facility to the flushing apart 
>from a
>helper.  That same helper can live anywhere.

Currently we have outer_alg -> mcryptd alg -> inner_alg

Mcryptd is mainly providing the following:
1. Ensuring the lanes(8 in case of AVX2) are full before dispatching to the 
lower inner algorithm. This is obviously why we would expect better performance 
for multi-buffer as opposed to the present single-buffer algorithms.
2. If there no new incoming jobs, issue a flush.
3. A glue layer which sends the correct pointers and completions.

If we get rid of mcryptd, these functions needs to be done by someone. Since 
all multi-buffer algorithms would require this tasks, where do you suggest 
these helpers live, if not the current mcryptd.c?

I am not sure if you are suggesting that we need to get rid of the mcryptd work 
queue itself. In that case, we would need to execute in the context of the job 
requesting the crypto transformation.
>
>Cheers,
>--
>Email: Herbert Xu  Home Page:
>http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


RE: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure support

2018-04-24 Thread Dey, Megha


>-Original Message-
>From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
>Sent: Wednesday, April 18, 2018 8:25 PM
>To: Dey, Megha 
>Cc: linux-ker...@vger.kernel.org; linux-crypto@vger.kernel.org;
>da...@davemloft.net
>Subject: Re: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure
>support
>
>On Thu, Apr 19, 2018 at 12:54:16AM +, Dey, Megha wrote:
>>
>> Yeah I think I misunderstood. I think what you mean is to remove mcryptd.c
>completely and avoid the extra layer of indirection to call the underlying
>algorithm, instead call it directly, correct?
>>
>> So currently we have 3 algorithms registered for every multibuffer algorithm:
>> name : __sha1-mb
>> driver   : mcryptd(__intel_sha1-mb)
>>
>> name : sha1
>> driver   : sha1_mb
>>
>> name : __sha1-mb
>> driver   : __intel_sha1-mb
>>
>> If we remove mcryptd, then we will have just the 2?
>
>It should be down to just one, i.e., the current inner algorithm.
>It's doing all the scheduling work already so I don't really see why it needs 
>the
>wrappers around it.

Hi Herbert,

Is there any existing implementation of async crypto algorithm that uses the 
above approach? The ones I could find are either sync, have an outer and inner 
algorithm or use cryptd.

I tried removing the mcryptd layer and the outer algorithm and some plumbing to 
pass the correct structures, but see crashes.(obviously some errors in the 
plumbing)

I am not sure if we remove mcryptd, how would we queue work, flush partially 
completed jobs or call completions (currently done by mcryptd) if we simply 
call the inner algorithm.

Are you suggesting these are not required at all? I am not sure how to move 
forward.

>
>Cheers,
>--
>Email: Herbert Xu  Home Page:
>http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


RE: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure support

2018-04-18 Thread Dey, Megha


>-Original Message-
>From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
>Sent: Wednesday, April 18, 2018 4:01 AM
>To: Dey, Megha 
>Cc: linux-ker...@vger.kernel.org; linux-crypto@vger.kernel.org;
>da...@davemloft.net
>Subject: Re: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure
>support
>
>On Tue, Apr 17, 2018 at 06:40:17PM +, Dey, Megha wrote:
>>
>>
>> >-Original Message-
>> >From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
>> >Sent: Friday, March 16, 2018 7:54 AM
>> >To: Dey, Megha 
>> >Cc: linux-ker...@vger.kernel.org; linux-crypto@vger.kernel.org;
>> >da...@davemloft.net
>> >Subject: Re: [PATCH V8 1/5] crypto: Multi-buffer encryption
>> >infrastructure support
>> >
>> >I have taken a deeper look and I'm even more convinced now that
>> >mcryptd is simply not needed in your current model.
>> >
>> >The only reason you would need mcryptd is if you need to limit the
>> >rate of requests going into the underlying mb algorithm.
>> >
>> >However, it doesn't do that all.  Even though it seems to have a
>> >batch size of 10, but because it immediately reschedules itself after
>> >the batch runs out, it's essentially just dumping all requests at the
>> >underlying algorithm as fast as they're coming in.  The underlying
>> >algorithm doesn't have need throttling anyway because it'll do the work
>when the queue is full synchronously.
>> >
>> >So why not just get rid of mcryptd completely and expose the
>> >underlying algorithm as a proper async skcipher/hash?
>>
>> Hi Herbert,
>>
>> Most part of the cryptd.c and mcryptd.c are similar, except the logic
>> used to flush out partially completed jobs in the case of multibuffer
>algorithms.
>>
>> I think I will try to merge the cryptd and mcryptd adding necessary quirks 
>> for
>multibuffer where needed.
>
>I think you didn't quite get my point.  From what I'm seeing you don't need
>either cryptd or mcryptd.  You just need to expose the underlying mb
>algorithm directly.

Hi Herbert,

Yeah I think I misunderstood. I think what you mean is to remove mcryptd.c 
completely and avoid the extra layer of indirection to call the underlying 
algorithm, instead call it directly, correct?

So currently we have 3 algorithms registered for every multibuffer algorithm:
name : __sha1-mb
driver   : mcryptd(__intel_sha1-mb)

name : sha1
driver   : sha1_mb

name : __sha1-mb
driver   : __intel_sha1-mb

If we remove mcryptd, then we will have just the 2?

The outer algorithm:sha1-mb, will 
>
>So I'm not sure what we would gain from merging cryptd and mcryptd.
>
>Cheers,
>--
>Email: Herbert Xu  Home Page:
>http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


RE: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure support

2018-04-17 Thread Dey, Megha


>-Original Message-
>From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
>Sent: Friday, March 16, 2018 7:54 AM
>To: Dey, Megha 
>Cc: linux-ker...@vger.kernel.org; linux-crypto@vger.kernel.org;
>da...@davemloft.net
>Subject: Re: [PATCH V8 1/5] crypto: Multi-buffer encryption infrastructure
>support
>
>On Thu, Jan 18, 2018 at 04:44:21PM -0800, Megha Dey wrote:
>>
>> > So the mcryptd template is in fact completely superfluous.  You can
>> > remove it and just have all the main encrypt/decrypt functions
>> > invoke the underlying encrypt/decrypt function directly and achieve
>> > the same result.
>> >
>> > Am I missing something?
>>
>> Hi Herbert,
>>
>> After discussing with Tim, it seems like the mcryptd is responsible
>> for queuing up the encrypt requests and dispatching them to the actual
>> multi-buffer raw algorithm.  It also flushes the queue if we wait too
>> long without new requests coming in to force dispatch of the requests
>> in queue.
>>
>> Its function is analogous to cryptd but it has its own multi-lane
>> twists so we haven't reused the cryptd interface.
>
>I have taken a deeper look and I'm even more convinced now that mcryptd is
>simply not needed in your current model.
>
>The only reason you would need mcryptd is if you need to limit the rate of
>requests going into the underlying mb algorithm.
>
>However, it doesn't do that all.  Even though it seems to have a batch size of
>10, but because it immediately reschedules itself after the batch runs out,
>it's essentially just dumping all requests at the underlying algorithm as fast
>as they're coming in.  The underlying algorithm doesn't have need throttling
>anyway because it'll do the work when the queue is full synchronously.
>
>So why not just get rid of mcryptd completely and expose the underlying
>algorithm as a proper async skcipher/hash?

Hi Herbert,

Most part of the cryptd.c and mcryptd.c are similar, except the logic used to 
flush out partially completed jobs
in the case of multibuffer algorithms.

I think I will try to merge the cryptd and mcryptd adding necessary quirks for 
multibuffer where needed.

Also, in cryptd.c, I see shash interface being used for hash digests, update, 
finup, setkey etc. whereas we have shifted
to ahash interface for mcryptd. Is this correct?

Thanks,
Megha
 
>
>Thanks,
>--
>Email: Herbert Xu  Home Page:
>http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


RE: sha1_mb broken

2016-10-03 Thread Dey, Megha


-Original Message-
From: Stephan Mueller [mailto:smuel...@chronox.de] 
Sent: Wednesday, September 28, 2016 10:31 PM
To: Dey, Megha 
Cc: linux-crypto@vger.kernel.org; tim.c.c...@linux.intel.com
Subject: Re: sha1_mb broken

Am Mittwoch, 28. September 2016, 22:52:46 CEST schrieb Dey, Megha:

Hi Megha,

see a self contained example code attached.
 
> Hi Stephan,
>
> Your test code initialized the completion structure incorrectly, that led to 
> the missing completion from being received. The init_completion call should 
> be made before the crypto_ahash_digest call. The following change to your 
> test code fixes things. ( I have also fixed what I believe is a typo 
> aead->ahash)

> @@ -74,6 +74,8 @@ static unsigned int kccavs_ahash_op(struct kccavs_ahash_def 
> *ahash)
> {
>int rc = 0;
>
> +   init_completion(&ahash->result.completion);
> +
>rc = crypto_ahash_digest(ahash->req);
>
>switch (rc) {
>@@ -84,7 +86,7 @@ static unsigned int kccavs_ahash_op(struct kccavs_ahash_def 
>*ahash)
>rc = 
> wait_for_completion_interruptible(&ahash->result.completion);
>if (!rc && !ahash->result.err) {
> #ifdef OLDASYNC
> -   INIT_COMPLETION(aead->result.completion);
> +   INIT_COMPLETION(&ahash->result.completion);
> #else
>   reinit_completion(&ahash->result.completion);
> #endif
> @@ -95,7 +97,6 @@ static unsigned int kccavs_ahash_op(struct kccavs_ahash_def 
> *ahash)
>" %d\n",rc, ahash->result.err);
>break;
>   }
> -   init_completion(&ahash->result.completion);
>
> return rc;
>}
> This initialization of the completion structure happens correctly in the 
> tcrypt test module used by the kernel, hence I did not come across this issue 
> earlier.
> Thanks,
> Megha
Ciao
Stephan
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: sha1_mb broken

2016-09-28 Thread Dey, Megha


-Original Message-
From: Stephan Mueller [mailto:smuel...@chronox.de] 
Sent: Wednesday, September 28, 2016 11:46 AM
To: Dey, Megha 
Cc: linux-crypto@vger.kernel.org; tim.c.c...@linux.intel.com
Subject: Re: sha1_mb broken

Am Mittwoch, 28. September 2016, 11:25:47 CEST schrieb Megha Dey:

Hi Megha,

> Hi Stephan,
> 
> There was a bug fix: Commit ID : 0851561d (introduced in 4.6-rc5).

I use the current cryptodev-2.6 tree.
> 
> Assuming that you are using an older kernel than this one, maybe we 
> are issuing the complete with the wrong pointer, so the original 
> issuer of the request never gets the complete back.
> 
> If you are using an older kernel, can you please use the lastest and 
> let me know if you still see this issue?
> 
> Also can you give more info on the test case? Does it issue single 
> request or multiple requests?

It is a single test with the message "8c899bba" where I expect 
"ac6d8c4851beacf61c175aed0699053d8f632df8"

>> For the message 8c899bba, the expected hash is  not 
>> ac6d8c4851beacf61c175aed0699053d8f632df8 but 
2e2816f6241fef89d78dd7afc926adc03ca94ace. I used this hash value in the 
existing tcrypt test and the test passes.

I would like to duplicate the test case you are using so that I can have a 
better look.
Can you provide the complete code for the test case? The code you sent earlier 
are missing some structure definitions. I will try to incorporate this test in 
the tcrypt test suite and try to reproduce this issue.

The code is the following which works with any other hash:

struct kccavs_ahash_def {
struct crypto_ahash *tfm;
struct ahash_request *req;
struct kccavs_tcrypt_res result;
};

/* Callback function */
static void kccavs_ahash_cb(struct crypto_async_request *req, int error) {
struct kccavs_tcrypt_res *result = req->data;

if (error == -EINPROGRESS)
return;
result->err = error;
complete(&result->completion);
dbg("Encryption finished successfully\n"); }

/* Perform hash */
static unsigned int kccavs_ahash_op(struct kccavs_ahash_def *ahash) {
int rc = 0;

rc = crypto_ahash_digest(ahash->req);

switch (rc) {
case 0:
break;
case -EINPROGRESS:
case -EBUSY:
rc = 
wait_for_completion_interruptible(&ahash->result.completion);
if (!rc && !ahash->result.err) {
#ifdef OLDASYNC
INIT_COMPLETION(aead->result.completion);
#else
reinit_completion(&ahash->result.completion);
#endif
break;
}
default:
dbg("ahash cipher operation returned with %d result"
" %d\n",rc, ahash->result.err);
break;
}
init_completion(&ahash->result.completion);

return rc;
}

static int kccavs_test_ahash(size_t nbytes) {
int ret;
struct crypto_ahash *tfm;
struct ahash_request *req = NULL;
struct kccavs_ahash_def ahash;
struct kccavs_data *data = &kccavs_test->data;
struct kccavs_data *key = &kccavs_test->key;
unsigned char *digest = NULL;
struct scatterlist sg;

/*
 * We explicitly do not check the input buffer as we allow
 * an empty string.
 */

/* allocate synchronous hash */
tfm = crypto_alloc_ahash(kccavs_test->name, 0, 0);
if (IS_ERR(tfm)) {
pr_info("could not allocate digest TFM handle for %s\n", 
kccavs_test-
>name);
return PTR_ERR(tfm);
}

digest = kzalloc(crypto_ahash_digestsize(tfm), GFP_KERNEL);
if (!digest) {
ret = -ENOMEM;
goto out;
}

req = ahash_request_alloc(tfm, GFP_KERNEL);
if (!req) {
pr_info("could not allocate request queue\n");
ret = -ENOMEM;
goto out;
}
ahash.tfm = tfm;
ahash.req = req;

if (key->len) {
dbg("set key for HMAC\n");
ret = crypto_ahash_setkey(tfm, key->data, key->len);
if (ret < 0)
goto out;
}

sg_init_one(&sg, data->data, data->len);
ahash_request_set_crypt(req, &sg, digest, data->len);
ahash_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
   kccavs_ahash_cb, &ahash.result);

ret = kccavs_ahash_op(&ahash);
if (!ret) {
data->len = crypto_ahash_digestsize(tfm);
memcpy(data->data, digest, data->len);
}

out:
kzfree(digest);
if (req)

RE: [PATCH] crypto : async implementation for sha1-mb

2016-06-13 Thread Dey, Megha


-Original Message-
From: Herbert Xu [mailto:herb...@gondor.apana.org.au] 
Sent: Monday, June 13, 2016 1:22 AM
To: Dey, Megha 
Cc: tim.c.c...@linux.intel.com; da...@davemloft.net; 
linux-crypto@vger.kernel.org; linux-ker...@vger.kernel.org; Yu, Fenghua 
; Megha Dey 
Subject: Re: [PATCH] crypto : async implementation for sha1-mb

On Tue, Jun 07, 2016 at 11:14:42AM -0700, Megha Dey wrote:
>
> - desc->tfm = child;
> - desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP;
> + ahash_request_set_tfm(desc, child);
> + ahash_request_set_callback(desc, CRYPTO_TFM_REQ_MAY_SLEEP, NULL, 
> +NULL);

Why are the callbacks set to NULL/NULL? The child is async so you should have a 
valid callback function here.

Instead of continuing to do the broken callback handling outside of the API 
(i.e., rctx->complete) please use the API mechanism that is provided for this 
purpose.

> In the current implementation, the inner algorithm is called directly, and we 
> use the outer algorithm's callback. We do not use the callback in inner 
> algorithm. We are actually calling the child functions directly and the child 
> is using the parent's call back function. Probably I can add a comment before 
> the set callback function.. will this be ok?

Thanks,
--
Email: Herbert Xu  Home Page: 
http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V2 2/2] crypto : async implementation for sha1-mb

2016-06-07 Thread Dey, Megha


-Original Message-
From: Herbert Xu [mailto:herb...@gondor.apana.org.au] 
Sent: Tuesday, June 7, 2016 3:35 AM
To: Dey, Megha 
Cc: tim.c.c...@linux.intel.com; da...@davemloft.net; 
linux-crypto@vger.kernel.org; linux-ker...@vger.kernel.org; Yu, Fenghua 
; Megha Dey 
Subject: Re: [PATCH V2 2/2] crypto : async implementation for sha1-mb

On Thu, Jun 02, 2016 at 07:53:50PM -0700, Megha Dey wrote:
>
> + struct ahash_alg *shash = crypto_ahash_alg(tfm);
>  
>   /* alignment is to be done by multi-buffer crypto algorithm if 
> needed */
>  
> - return shash->finup(desc, NULL, 0, req->result);
> + return shash->finup(desc);

You're still poking in the guts of the API.  Now that it's a real ahash you 
don't need to do that.

Just do crypto_ahash_finup.  That way you don't need to export crypto_ahsh_alg 
either.

> I have made these changes and re-sent the patch.

Thanks,
--
Email: Herbert Xu  Home Page: 
http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] crypto : async implementation for sha1-mb

2016-06-02 Thread Dey, Megha


-Original Message-
From: Herbert Xu [mailto:herb...@gondor.apana.org.au] 
Sent: Thursday, June 2, 2016 5:33 PM
To: Dey, Megha 
Cc: tim.c.c...@linux.intel.com; da...@davemloft.net; 
linux-crypto@vger.kernel.org; linux-ker...@vger.kernel.org; Yu, Fenghua 

Subject: Re: [PATCH 2/2] crypto : async implementation for sha1-mb

On Thu, Jun 02, 2016 at 10:20:20AM -0700, Megha Dey wrote:
>
> > > @@ -439,17 +444,18 @@ static int mcryptd_hash_finup_enqueue(struct 
> > > ahash_request *req)  static void mcryptd_hash_digest(struct 
> > > crypto_async_request *req_async, int err)  {
> > >   struct mcryptd_hash_ctx *ctx = crypto_tfm_ctx(req_async->tfm);
> > > - struct crypto_shash *child = ctx->child;
> > > + struct crypto_ahash *child = ctx->child;
> > >   struct ahash_request *req = ahash_request_cast(req_async);
> > >   struct mcryptd_hash_request_ctx *rctx = ahash_request_ctx(req);
> > > - struct shash_desc *desc = &rctx->desc;
> > > + struct ahash_request *desc = &rctx->areq;
> > > + struct crypto_async_request *base = &desc->base;
> > >  
> > >   if (unlikely(err == -EINPROGRESS))
> > >   goto out;
> > > + base->tfm = &child->base;
> > > + base->flags = CRYPTO_TFM_REQ_MAY_SLEEP;  /* check this again */
> > 
> > You should not be touching crypto_async_request directly.  Use the 
> > proper ahash interface to set the child request.
> > 
> Herbert, Could you please clarify?
> In the earlier code we had a async_request which is now replaced by 
> crypto_async_request. Do you want a new async_request to be used?
> Do you think we shouldn't be setting the members of the 
> crypto_ahash_request directly, but use some other interface to do the 
> same for us?

You already have an ahash_request here.  So you should be doing

ahash_request_set_tfm(...)
ahash_request_set_callback(...)

>ok,done!
Thanks,
--
Email: Herbert Xu  Home Page: 
http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html