Re: [RFC PATCH 0/6] Add bulk skcipher requests to crypto API and dm-crypt
On Tue, Jan 17, 2017 at 12:20:02PM +0100, Ondrej Mosnáček wrote: > 2017-01-13 15:29 GMT+01:00 Herbert Xu: > > What if the driver had hardware support for generating these IVs? > > With your scheme this cannot be supported at all. > > That's true... I'm starting to think that this isn't really a good > idea. I was mainly trying to keep the door open for the random IV > support and also to keep the multi-key stuff (which was really only > intended for loop-AES partition support) out of the crypto API, but > both of these can be probably solved in a better way... As you said that the multi-key stuff is legacy-only I too would like to see a way to keep that complexity out of the common path. > > With such a definition you could either generate the IVs in dm-crypt > > or have them generated in the IV generator. > > That seems kind of hacky to me... but if that's what you prefer, then so be > it. I'm open to other proposals. The basic requirement is to be able to process multiple blocks as one entity at the driver level, potentially generating the IVs there too. It's essentially the equivalent to full IPsec offload. Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/8] random: remove unused branch in hot code path
On Tue, Dec 27, 2016 at 11:40:23PM +0100, Stephan Müller wrote: > The variable ip is defined to be a __u64 which is always 8 bytes on any > architecture. Thus, the check for sizeof(ip) > 4 will always be true. > > As the check happens in a hot code path, remove the branch. The fact that it's a hot code path means the compiler will optimize it out, so the fact that it's on the hot code path is irrelevant. The main issue is that on platforms with a 32-bit IP's, ip >> 32 will always be zero. It might be that we can just do this via #if BITS_PER_LONG == 32 ... #else ... #endif I'm not sure that works for all platforms, though. More research is needed... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/8] random: trigger random_ready callback upon crng_init == 1
On Tue, Dec 27, 2016 at 11:39:57PM +0100, Stephan Müller wrote: > The random_ready callback mechanism is intended to replicate the > getrandom system call behavior to in-kernel users. As the getrandom > system call unblocks with crng_init == 1, trigger the random_ready > wakeup call at the same time. It was deliberate that random_ready would only get triggered with crng_init==2. In general I'm assuming kernel callers really want real randomness (as opposed to using prandom), where as there's a lot of b.s. userspace users of kernel randomness (for things that really don't require cryptographic randomness, e.g., for salting Python dictionaries, systemd/udev using /dev/urandom for non-cryptographic, non-security applications etc.) - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2)
Patch #1 is a fix for the CBC chaining issue that was discussed on the mailing list. The driver itself is queued for v4.11, so this fix can go right on top. Patches #2 - #6 clear the cra_alignmasks of various drivers: all NEON capable CPUs can perform unaligned accesses, and the advantage of using the slightly faster aligned accessors (which only exist on ARM not arm64) is certainly outweighed by the cost of copying data to suitably aligned buffers. NOTE: patch #5 won't apply unless 'crypto: arm64/aes-blk - honour iv_out requirement in CBC and CTR modes' is applied first, which was sent out separately as a bugfix for v3.16 - v4.9. If this is a problem, this patch can wait. Patch #7 and #8 are minor tweaks to the new scalar AES code. Patch #9 improves the performance of the plain NEON AES code, to make it more suitable as a fallback for the new bitsliced NEON code, which can only operate on 8 blocks in parallel, and needs another driver to perform CBC encryption or XTS tweak generation. Patch #10 updates the new bitsliced AES NEON code to switch to the plain NEON driver as a fallback. Patches #9 and #10 improve the performance of CBC encryption by ~35% on low end cores such as the Cortex-A53 that can be found in the Raspberry Pi3 Ard Biesheuvel (10): crypto: arm64/aes-neon-bs - honour iv_out requirement in CTR mode crypto: arm/aes-ce - remove cra_alignmask crypto: arm/chacha20 - remove cra_alignmask crypto: arm64/aes-ce-ccm - remove cra_alignmask crypto: arm64/aes-blk - remove cra_alignmask crypto: arm64/chacha20 - remove cra_alignmask crypto: arm64/aes - avoid literals for cross-module symbol references crypto: arm64/aes - performance tweak crypto: arm64/aes-neon-blk - tweak performance for low end cores crypto: arm64/aes - replace scalar fallback with plain NEON fallback arch/arm/crypto/aes-ce-core.S | 84 - arch/arm/crypto/aes-ce-glue.c | 15 +- arch/arm/crypto/chacha20-neon-glue.c | 1 - arch/arm64/crypto/Kconfig | 2 +- arch/arm64/crypto/aes-ce-ccm-glue.c| 1 - arch/arm64/crypto/aes-cipher-core.S| 59 +++--- arch/arm64/crypto/aes-glue.c | 18 +- arch/arm64/crypto/aes-modes.S | 8 +- arch/arm64/crypto/aes-neon.S | 199 arch/arm64/crypto/aes-neonbs-core.S| 25 ++- arch/arm64/crypto/aes-neonbs-glue.c| 38 +++- arch/arm64/crypto/chacha20-neon-glue.c | 1 - 12 files changed, 199 insertions(+), 252 deletions(-) -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/10] crypto: arm/chacha20 - remove cra_alignmask
Remove the unnecessary alignmask: it is much more efficient to deal with the misalignment in the core algorithm than relying on the crypto API to copy the data to a suitably aligned buffer. Signed-off-by: Ard Biesheuvel--- arch/arm/crypto/chacha20-neon-glue.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/arm/crypto/chacha20-neon-glue.c b/arch/arm/crypto/chacha20-neon-glue.c index 592f75ae4fa1..59a7be08e80c 100644 --- a/arch/arm/crypto/chacha20-neon-glue.c +++ b/arch/arm/crypto/chacha20-neon-glue.c @@ -94,7 +94,6 @@ static struct skcipher_alg alg = { .base.cra_priority = 300, .base.cra_blocksize = 1, .base.cra_ctxsize = sizeof(struct chacha20_ctx), - .base.cra_alignmask = 1, .base.cra_module= THIS_MODULE, .min_keysize= CHACHA20_KEY_SIZE, -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/10] crypto: arm64/aes-neon-bs - honour iv_out requirement in CTR mode
Update the new bitsliced NEON AES implementation in CTR mode to return the next IV back to the skcipher API client. This is necessary for chaining to work correctly. Note that this is only done if the request is a round multiple of the block size, since otherwise, chaining is impossible anyway. Signed-off-by: Ard Biesheuvel--- arch/arm64/crypto/aes-neonbs-core.S | 25 +--- 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/arch/arm64/crypto/aes-neonbs-core.S b/arch/arm64/crypto/aes-neonbs-core.S index 8d0cdaa2768d..2ada12dd768e 100644 --- a/arch/arm64/crypto/aes-neonbs-core.S +++ b/arch/arm64/crypto/aes-neonbs-core.S @@ -874,12 +874,19 @@ CPU_LE( rev x8, x8 ) cselx4, x4, xzr, pl cselx9, x9, xzr, le + tbnzx9, #1, 0f next_ctrv1 + tbnzx9, #2, 0f next_ctrv2 + tbnzx9, #3, 0f next_ctrv3 + tbnzx9, #4, 0f next_ctrv4 + tbnzx9, #5, 0f next_ctrv5 + tbnzx9, #6, 0f next_ctrv6 + tbnzx9, #7, 0f next_ctrv7 0: mov bskey, x2 @@ -928,11 +935,11 @@ CPU_LE( rev x8, x8 ) eor v5.16b, v5.16b, v15.16b st1 {v5.16b}, [x0], #16 - next_ctrv0 +8: next_ctrv0 cbnzx4, 99b 0: st1 {v0.16b}, [x5] -8: ldp x29, x30, [sp], #16 +9: ldp x29, x30, [sp], #16 ret /* @@ -941,23 +948,23 @@ CPU_LE( rev x8, x8 ) */ 1: cbz x6, 8b st1 {v1.16b}, [x5] - b 8b + b 9b 2: cbz x6, 8b st1 {v4.16b}, [x5] - b 8b + b 9b 3: cbz x6, 8b st1 {v6.16b}, [x5] - b 8b + b 9b 4: cbz x6, 8b st1 {v3.16b}, [x5] - b 8b + b 9b 5: cbz x6, 8b st1 {v7.16b}, [x5] - b 8b + b 9b 6: cbz x6, 8b st1 {v2.16b}, [x5] - b 8b + b 9b 7: cbz x6, 8b st1 {v5.16b}, [x5] - b 8b + b 9b ENDPROC(aesbs_ctr_encrypt) -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/10] crypto: arm64/aes - avoid literals for cross-module symbol references
Using simple adrp/add pairs to refer to the AES lookup tables exposed by the generic AES driver (which could be loaded far away from this driver when KASLR is in effect) was unreliable at module load time before commit 41c066f2c4d4 ("arm64: assembler: make adr_l work in modules under KASLR"), which is why the AES code used literals instead. So now we can get rid of the literals, and switch to the adr_l macro. Signed-off-by: Ard Biesheuvel--- arch/arm64/crypto/aes-cipher-core.S | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/arm64/crypto/aes-cipher-core.S b/arch/arm64/crypto/aes-cipher-core.S index 37590ab8121a..cd58c61e6677 100644 --- a/arch/arm64/crypto/aes-cipher-core.S +++ b/arch/arm64/crypto/aes-cipher-core.S @@ -89,8 +89,8 @@ CPU_BE( rev w8, w8 ) eor w7, w7, w11 eor w8, w8, w12 - ldr tt, =\ttab - ldr lt, =\ltab + adr_l tt, \ttab + adr_l lt, \ltab tbnzrounds, #1, 1f @@ -111,9 +111,6 @@ CPU_BE( rev w8, w8 ) stp w5, w6, [out] stp w7, w8, [out, #8] ret - - .align 4 - .ltorg .endm .align 5 -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/10] crypto: arm64/aes - replace scalar fallback with plain NEON fallback
The new bitsliced NEON implementation of AES uses a fallback in two places: CBC encryption (which is strictly sequential, whereas this driver can only operate efficiently on 8 blocks at a time), and the XTS tweak generation, which involves encrypting a single AES block with a different key schedule. The plain (i.e., non-bitsliced) NEON code is more suitable as a fallback, given that it is faster than scalar on low end cores (which is what the NEON implementations target, since high end cores have dedicated instructions for AES), and shows similar behavior in terms of D-cache footprint and sensitivity to cache timing attacks. So switch the fallback handling to the plain NEON driver. Signed-off-by: Ard Biesheuvel--- arch/arm64/crypto/Kconfig | 2 +- arch/arm64/crypto/aes-neonbs-glue.c | 38 ++-- 2 files changed, 29 insertions(+), 11 deletions(-) diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig index 5de75c3dcbd4..bed7feddfeed 100644 --- a/arch/arm64/crypto/Kconfig +++ b/arch/arm64/crypto/Kconfig @@ -86,7 +86,7 @@ config CRYPTO_AES_ARM64_BS tristate "AES in ECB/CBC/CTR/XTS modes using bit-sliced NEON algorithm" depends on KERNEL_MODE_NEON select CRYPTO_BLKCIPHER - select CRYPTO_AES_ARM64 + select CRYPTO_AES_ARM64_NEON_BLK select CRYPTO_SIMD endif diff --git a/arch/arm64/crypto/aes-neonbs-glue.c b/arch/arm64/crypto/aes-neonbs-glue.c index 323dd76ae5f0..863e436ecf89 100644 --- a/arch/arm64/crypto/aes-neonbs-glue.c +++ b/arch/arm64/crypto/aes-neonbs-glue.c @@ -10,7 +10,6 @@ #include #include -#include #include #include #include @@ -42,7 +41,12 @@ asmlinkage void aesbs_xts_encrypt(u8 out[], u8 const in[], u8 const rk[], asmlinkage void aesbs_xts_decrypt(u8 out[], u8 const in[], u8 const rk[], int rounds, int blocks, u8 iv[]); -asmlinkage void __aes_arm64_encrypt(u32 *rk, u8 *out, const u8 *in, int rounds); +/* borrowed from aes-neon-blk.ko */ +asmlinkage void neon_aes_ecb_encrypt(u8 out[], u8 const in[], u32 const rk[], +int rounds, int blocks, int first); +asmlinkage void neon_aes_cbc_encrypt(u8 out[], u8 const in[], u32 const rk[], +int rounds, int blocks, u8 iv[], +int first); struct aesbs_ctx { u8 rk[13 * (8 * AES_BLOCK_SIZE) + 32]; @@ -140,16 +144,28 @@ static int aesbs_cbc_setkey(struct crypto_skcipher *tfm, const u8 *in_key, return 0; } -static void cbc_encrypt_one(struct crypto_skcipher *tfm, const u8 *src, u8 *dst) +static int cbc_encrypt(struct skcipher_request *req) { + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); struct aesbs_cbc_ctx *ctx = crypto_skcipher_ctx(tfm); + struct skcipher_walk walk; + int err, first = 1; - __aes_arm64_encrypt(ctx->enc, dst, src, ctx->key.rounds); -} + err = skcipher_walk_virt(, req, true); -static int cbc_encrypt(struct skcipher_request *req) -{ - return crypto_cbc_encrypt_walk(req, cbc_encrypt_one); + kernel_neon_begin(); + while (walk.nbytes >= AES_BLOCK_SIZE) { + unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE; + + /* fall back to the non-bitsliced NEON implementation */ + neon_aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr, +ctx->enc, ctx->key.rounds, blocks, walk.iv, +first); + err = skcipher_walk_done(, walk.nbytes % AES_BLOCK_SIZE); + first = 0; + } + kernel_neon_end(); + return err; } static int cbc_decrypt(struct skcipher_request *req) @@ -254,9 +270,11 @@ static int __xts_crypt(struct skcipher_request *req, err = skcipher_walk_virt(, req, true); - __aes_arm64_encrypt(ctx->twkey, walk.iv, walk.iv, ctx->key.rounds); - kernel_neon_begin(); + + neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey, +ctx->key.rounds, 1, 1); + while (walk.nbytes >= AES_BLOCK_SIZE) { unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/10] crypto: arm64/aes-blk - remove cra_alignmask
Remove the unnecessary alignmask: it is much more efficient to deal with the misalignment in the core algorithm than relying on the crypto API to copy the data to a suitably aligned buffer. Signed-off-by: Ard Biesheuvel--- NOTE: this won't apply unless 'crypto: arm64/aes-blk - honour iv_out requirement in CBC and CTR modes' is applied first, which was sent out separately as a bugfix for v3.16 - v4.9 arch/arm64/crypto/aes-glue.c | 16 ++-- arch/arm64/crypto/aes-modes.S | 8 +++- 2 files changed, 9 insertions(+), 15 deletions(-) diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c index 5164aaf82c6a..8ee1fb7aaa4f 100644 --- a/arch/arm64/crypto/aes-glue.c +++ b/arch/arm64/crypto/aes-glue.c @@ -215,14 +215,15 @@ static int ctr_encrypt(struct skcipher_request *req) u8 *tsrc = walk.src.virt.addr; /* -* Minimum alignment is 8 bytes, so if nbytes is <= 8, we need -* to tell aes_ctr_encrypt() to only read half a block. +* Tell aes_ctr_encrypt() to process a tail block. */ - blocks = (nbytes <= 8) ? -1 : 1; + blocks = -1; - aes_ctr_encrypt(tail, tsrc, (u8 *)ctx->key_enc, rounds, + aes_ctr_encrypt(tail, NULL, (u8 *)ctx->key_enc, rounds, blocks, walk.iv, first); - memcpy(tdst, tail, nbytes); + if (tdst != tsrc) + memcpy(tdst, tsrc, nbytes); + crypto_xor(tdst, tail, nbytes); err = skcipher_walk_done(, 0); } kernel_neon_end(); @@ -282,7 +283,6 @@ static struct skcipher_alg aes_algs[] = { { .cra_flags = CRYPTO_ALG_INTERNAL, .cra_blocksize = AES_BLOCK_SIZE, .cra_ctxsize= sizeof(struct crypto_aes_ctx), - .cra_alignmask = 7, .cra_module = THIS_MODULE, }, .min_keysize= AES_MIN_KEY_SIZE, @@ -298,7 +298,6 @@ static struct skcipher_alg aes_algs[] = { { .cra_flags = CRYPTO_ALG_INTERNAL, .cra_blocksize = AES_BLOCK_SIZE, .cra_ctxsize= sizeof(struct crypto_aes_ctx), - .cra_alignmask = 7, .cra_module = THIS_MODULE, }, .min_keysize= AES_MIN_KEY_SIZE, @@ -315,7 +314,6 @@ static struct skcipher_alg aes_algs[] = { { .cra_flags = CRYPTO_ALG_INTERNAL, .cra_blocksize = 1, .cra_ctxsize= sizeof(struct crypto_aes_ctx), - .cra_alignmask = 7, .cra_module = THIS_MODULE, }, .min_keysize= AES_MIN_KEY_SIZE, @@ -332,7 +330,6 @@ static struct skcipher_alg aes_algs[] = { { .cra_priority = PRIO - 1, .cra_blocksize = 1, .cra_ctxsize= sizeof(struct crypto_aes_ctx), - .cra_alignmask = 7, .cra_module = THIS_MODULE, }, .min_keysize= AES_MIN_KEY_SIZE, @@ -350,7 +347,6 @@ static struct skcipher_alg aes_algs[] = { { .cra_flags = CRYPTO_ALG_INTERNAL, .cra_blocksize = AES_BLOCK_SIZE, .cra_ctxsize= sizeof(struct crypto_aes_xts_ctx), - .cra_alignmask = 7, .cra_module = THIS_MODULE, }, .min_keysize= 2 * AES_MIN_KEY_SIZE, diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S index 838dad5c209f..92b982a8b112 100644 --- a/arch/arm64/crypto/aes-modes.S +++ b/arch/arm64/crypto/aes-modes.S @@ -337,7 +337,7 @@ AES_ENTRY(aes_ctr_encrypt) .Lctrcarrydone: subsw4, w4, #1 - bmi .Lctrhalfblock /* blocks < 0 means 1/2 block */ + bmi .Lctrtailblock /* blocks <0 means tail block */ ld1 {v3.16b}, [x1], #16 eor v3.16b, v0.16b, v3.16b st1 {v3.16b}, [x0], #16 @@ -348,10 +348,8 @@ AES_ENTRY(aes_ctr_encrypt) FRAME_POP ret -.Lctrhalfblock: - ld1 {v3.8b}, [x1] - eor v3.8b, v0.8b, v3.8b - st1 {v3.8b}, [x0] +.Lctrtailblock: + st1 {v0.16b}, [x0] FRAME_POP ret -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/10] crypto: arm64/aes - performance tweak
Shuffle some instructions around in the __hround macro to shave off 0.1 cycles per byte on Cortex-A57. Signed-off-by: Ard Biesheuvel--- arch/arm64/crypto/aes-cipher-core.S | 52 +++- 1 file changed, 19 insertions(+), 33 deletions(-) diff --git a/arch/arm64/crypto/aes-cipher-core.S b/arch/arm64/crypto/aes-cipher-core.S index cd58c61e6677..f2f9cc519309 100644 --- a/arch/arm64/crypto/aes-cipher-core.S +++ b/arch/arm64/crypto/aes-cipher-core.S @@ -20,46 +20,32 @@ tt .reqx4 lt .reqx2 - .macro __hround, out0, out1, in0, in1, in2, in3, t0, t1, enc - ldp \out0, \out1, [rk], #8 - - ubfxw13, \in0, #0, #8 - ubfxw14, \in1, #8, #8 - ldr w13, [tt, w13, uxtw #2] - ldr w14, [tt, w14, uxtw #2] - + .macro __pair, enc, reg0, reg1, in0, in1e, in1d, shift + ubfx\reg0, \in0, #\shift, #8 .if \enc - ubfxw17, \in1, #0, #8 - ubfxw18, \in2, #8, #8 + ubfx\reg1, \in1e, #\shift, #8 .else - ubfxw17, \in3, #0, #8 - ubfxw18, \in0, #8, #8 + ubfx\reg1, \in1d, #\shift, #8 .endif - ldr w17, [tt, w17, uxtw #2] - ldr w18, [tt, w18, uxtw #2] + ldr \reg0, [tt, \reg0, uxtw #2] + ldr \reg1, [tt, \reg1, uxtw #2] + .endm - ubfxw15, \in2, #16, #8 - ubfxw16, \in3, #24, #8 - ldr w15, [tt, w15, uxtw #2] - ldr w16, [tt, w16, uxtw #2] + .macro __hround, out0, out1, in0, in1, in2, in3, t0, t1, enc + ldp \out0, \out1, [rk], #8 - .if \enc - ubfx\t0, \in3, #16, #8 - ubfx\t1, \in0, #24, #8 - .else - ubfx\t0, \in1, #16, #8 - ubfx\t1, \in2, #24, #8 - .endif - ldr \t0, [tt, \t0, uxtw #2] - ldr \t1, [tt, \t1, uxtw #2] + __pair \enc, w13, w14, \in0, \in1, \in3, 0 + __pair \enc, w15, w16, \in1, \in2, \in0, 8 + __pair \enc, w17, w18, \in2, \in3, \in1, 16 + __pair \enc, \t0, \t1, \in3, \in0, \in2, 24 eor \out0, \out0, w13 - eor \out1, \out1, w17 - eor \out0, \out0, w14, ror #24 - eor \out1, \out1, w18, ror #24 - eor \out0, \out0, w15, ror #16 - eor \out1, \out1, \t0, ror #16 - eor \out0, \out0, w16, ror #8 + eor \out1, \out1, w14 + eor \out0, \out0, w15, ror #24 + eor \out1, \out1, w16, ror #24 + eor \out0, \out0, w17, ror #16 + eor \out1, \out1, w18, ror #16 + eor \out0, \out0, \t0, ror #8 eor \out1, \out1, \t1, ror #8 .endm -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/10] crypto: arm64/aes-neon-blk - tweak performance for low end cores
The non-bitsliced AES implementation using the NEON is highly sensitive to micro-architectural details, and, as it turns out, the Cortex-A53 on the Raspberry Pi 3 is a core that can benefit from this code, given that its scalar AES performance is abysmal (32.9 cycles per byte). The new bitsliced AES code manages 19.8 cycles per byte on this core, but can only operate on 8 blocks at a time, which is not supported by all chaining modes. With a bit of tweaking, we can get the plain NEON code to run at 24.0 cycles per byte, making it useful for sequential modes like CBC encryption. (Like bitsliced NEON, the plain NEON implementation does not use any lookup tables, which makes it easy on the D-cache, and invulnerable to cache timing attacks) So tweak the plain NEON AES code to use tbl instructions rather than shl/sri pairs, and to avoid the need to reload permutation vectors or other constants from memory in every round. To allow the ECB and CBC encrypt routines to be reused by the bitsliced NEON code in a subsequent patch, export them from the module. Signed-off-by: Ard Biesheuvel--- arch/arm64/crypto/aes-glue.c | 2 + arch/arm64/crypto/aes-neon.S | 199 2 files changed, 77 insertions(+), 124 deletions(-) diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c index 8ee1fb7aaa4f..055bc3f61138 100644 --- a/arch/arm64/crypto/aes-glue.c +++ b/arch/arm64/crypto/aes-glue.c @@ -409,5 +409,7 @@ static int __init aes_init(void) module_cpu_feature_match(AES, aes_init); #else module_init(aes_init); +EXPORT_SYMBOL(neon_aes_ecb_encrypt); +EXPORT_SYMBOL(neon_aes_cbc_encrypt); #endif module_exit(aes_exit); diff --git a/arch/arm64/crypto/aes-neon.S b/arch/arm64/crypto/aes-neon.S index 85f07ead7c5c..67c68462bc20 100644 --- a/arch/arm64/crypto/aes-neon.S +++ b/arch/arm64/crypto/aes-neon.S @@ -1,7 +1,7 @@ /* * linux/arch/arm64/crypto/aes-neon.S - AES cipher for ARMv8 NEON * - * Copyright (C) 2013 Linaro Ltd + * Copyright (C) 2013 - 2017 Linaro Ltd. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as @@ -25,9 +25,9 @@ /* preload the entire Sbox */ .macro prepare, sbox, shiftrows, temp adr \temp, \sbox - moviv12.16b, #0x40 + moviv12.16b, #0x1b ldr q13, \shiftrows - moviv14.16b, #0x1b + ldr q14, .Lror32by8 ld1 {v16.16b-v19.16b}, [\temp], #64 ld1 {v20.16b-v23.16b}, [\temp], #64 ld1 {v24.16b-v27.16b}, [\temp], #64 @@ -50,37 +50,33 @@ /* apply SubBytes transformation using the the preloaded Sbox */ .macro sub_bytes, in - sub v9.16b, \in\().16b, v12.16b + sub v9.16b, \in\().16b, v15.16b tbl \in\().16b, {v16.16b-v19.16b}, \in\().16b - sub v10.16b, v9.16b, v12.16b + sub v10.16b, v9.16b, v15.16b tbx \in\().16b, {v20.16b-v23.16b}, v9.16b - sub v11.16b, v10.16b, v12.16b + sub v11.16b, v10.16b, v15.16b tbx \in\().16b, {v24.16b-v27.16b}, v10.16b tbx \in\().16b, {v28.16b-v31.16b}, v11.16b .endm /* apply MixColumns transformation */ - .macro mix_columns, in - mul_by_xv10.16b, \in\().16b, v9.16b, v14.16b - rev32 v8.8h, \in\().8h - eor \in\().16b, v10.16b, \in\().16b - shl v9.4s, v8.4s, #24 - shl v11.4s, \in\().4s, #24 - sri v9.4s, v8.4s, #8 - sri v11.4s, \in\().4s, #8 - eor v9.16b, v9.16b, v8.16b - eor v10.16b, v10.16b, v9.16b - eor \in\().16b, v10.16b, v11.16b - .endm - + .macro mix_columns, in, enc + .if \enc == 0 /* Inverse MixColumns: pre-multiply by { 5, 0, 4, 0 } */ - .macro inv_mix_columns, in - mul_by_xv11.16b, \in\().16b, v10.16b, v14.16b - mul_by_xv11.16b, v11.16b, v10.16b, v14.16b + mul_by_xv11.16b, \in\().16b, v10.16b, v12.16b + mul_by_xv11.16b, v11.16b, v10.16b, v12.16b eor \in\().16b, \in\().16b, v11.16b rev32 v11.8h, v11.8h eor \in\().16b, \in\().16b, v11.16b - mix_columns \in + .endif + + mul_by_xv10.16b, \in\().16b, v9.16b, v12.16b + rev32 v8.8h, \in\().8h + eor \in\().16b, \in\().16b, v10.16b + eor v10.16b, v10.16b, v8.16b + eor v11.16b, \in\().16b, v8.16b + tbl v11.16b, {v11.16b}, v14.16b +
[PATCH 04/10] crypto: arm64/aes-ce-ccm - remove cra_alignmask
Remove the unnecessary alignmask: it is much more efficient to deal with the misalignment in the core algorithm than relying on the crypto API to copy the data to a suitably aligned buffer. Signed-off-by: Ard Biesheuvel--- arch/arm64/crypto/aes-ce-ccm-glue.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c index cc5515dac74a..6a7dbc7c83a6 100644 --- a/arch/arm64/crypto/aes-ce-ccm-glue.c +++ b/arch/arm64/crypto/aes-ce-ccm-glue.c @@ -258,7 +258,6 @@ static struct aead_alg ccm_aes_alg = { .cra_priority = 300, .cra_blocksize = 1, .cra_ctxsize= sizeof(struct crypto_aes_ctx), - .cra_alignmask = 7, .cra_module = THIS_MODULE, }, .ivsize = AES_BLOCK_SIZE, -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/10] crypto: arm64/chacha20 - remove cra_alignmask
Remove the unnecessary alignmask: it is much more efficient to deal with the misalignment in the core algorithm than relying on the crypto API to copy the data to a suitably aligned buffer. Signed-off-by: Ard Biesheuvel--- arch/arm64/crypto/chacha20-neon-glue.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/arm64/crypto/chacha20-neon-glue.c b/arch/arm64/crypto/chacha20-neon-glue.c index a7f2337d46cf..a7cd575ea223 100644 --- a/arch/arm64/crypto/chacha20-neon-glue.c +++ b/arch/arm64/crypto/chacha20-neon-glue.c @@ -93,7 +93,6 @@ static struct skcipher_alg alg = { .base.cra_priority = 300, .base.cra_blocksize = 1, .base.cra_ctxsize = sizeof(struct chacha20_ctx), - .base.cra_alignmask = 1, .base.cra_module= THIS_MODULE, .min_keysize= CHACHA20_KEY_SIZE, -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] crypto: arm64/aes-blk - honour iv_out requirement in CBC and CTR modes
Update the ARMv8 Crypto Extensions and the plain NEON AES implementations in CBC and CTR modes to return the next IV back to the skcipher API client. This is necessary for chaining to work correctly. Note that for CTR, this is only done if the request is a round multiple of the block size, since otherwise, chaining is impossible anyway. Cc:# v3.16+ Signed-off-by: Ard Biesheuvel --- arch/arm64/crypto/aes-modes.S | 88 ++-- 1 file changed, 42 insertions(+), 46 deletions(-) diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S index c53dbeae79f2..838dad5c209f 100644 --- a/arch/arm64/crypto/aes-modes.S +++ b/arch/arm64/crypto/aes-modes.S @@ -193,15 +193,16 @@ AES_ENTRY(aes_cbc_encrypt) cbz w6, .Lcbcencloop ld1 {v0.16b}, [x5] /* get iv */ - enc_prepare w3, x2, x5 + enc_prepare w3, x2, x6 .Lcbcencloop: ld1 {v1.16b}, [x1], #16 /* get next pt block */ eor v0.16b, v0.16b, v1.16b /* ..and xor with iv */ - encrypt_block v0, w3, x2, x5, w6 + encrypt_block v0, w3, x2, x6, w7 st1 {v0.16b}, [x0], #16 subsw4, w4, #1 bne .Lcbcencloop + st1 {v0.16b}, [x5] /* return iv */ ret AES_ENDPROC(aes_cbc_encrypt) @@ -211,7 +212,7 @@ AES_ENTRY(aes_cbc_decrypt) cbz w6, .LcbcdecloopNx ld1 {v7.16b}, [x5] /* get iv */ - dec_prepare w3, x2, x5 + dec_prepare w3, x2, x6 .LcbcdecloopNx: #if INTERLEAVE >= 2 @@ -248,7 +249,7 @@ AES_ENTRY(aes_cbc_decrypt) .Lcbcdecloop: ld1 {v1.16b}, [x1], #16 /* get next ct block */ mov v0.16b, v1.16b /* ...and copy to v0 */ - decrypt_block v0, w3, x2, x5, w6 + decrypt_block v0, w3, x2, x6, w7 eor v0.16b, v0.16b, v7.16b /* xor with iv => pt */ mov v7.16b, v1.16b /* ct is next iv */ st1 {v0.16b}, [x0], #16 @@ -256,6 +257,7 @@ AES_ENTRY(aes_cbc_decrypt) bne .Lcbcdecloop .Lcbcdecout: FRAME_POP + st1 {v7.16b}, [x5] /* return iv */ ret AES_ENDPROC(aes_cbc_decrypt) @@ -267,24 +269,15 @@ AES_ENDPROC(aes_cbc_decrypt) AES_ENTRY(aes_ctr_encrypt) FRAME_PUSH - cbnzw6, .Lctrfirst /* 1st time around? */ - umovx5, v4.d[1] /* keep swabbed ctr in reg */ - rev x5, x5 -#if INTERLEAVE >= 2 - cmn w5, w4 /* 32 bit overflow? */ - bcs .Lctrinc - add x5, x5, #1 /* increment BE ctr */ - b .LctrincNx -#else - b .Lctrinc -#endif -.Lctrfirst: + cbz w6, .Lctrnotfirst /* 1st time around? */ enc_prepare w3, x2, x6 ld1 {v4.16b}, [x5] - umovx5, v4.d[1] /* keep swabbed ctr in reg */ - rev x5, x5 + +.Lctrnotfirst: + umovx8, v4.d[1] /* keep swabbed ctr in reg */ + rev x8, x8 #if INTERLEAVE >= 2 - cmn w5, w4 /* 32 bit overflow? */ + cmn w8, w4 /* 32 bit overflow? */ bcs .Lctrloop .LctrloopNx: subsw4, w4, #INTERLEAVE @@ -292,11 +285,11 @@ AES_ENTRY(aes_ctr_encrypt) #if INTERLEAVE == 2 mov v0.8b, v4.8b mov v1.8b, v4.8b - rev x7, x5 - add x5, x5, #1 + rev x7, x8 + add x8, x8, #1 ins v0.d[1], x7 - rev x7, x5 - add x5, x5, #1 + rev x7, x8 + add x8, x8, #1 ins v1.d[1], x7 ld1 {v2.16b-v3.16b}, [x1], #32 /* get 2 input blocks */ do_encrypt_block2x @@ -305,7 +298,7 @@ AES_ENTRY(aes_ctr_encrypt) st1 {v0.16b-v1.16b}, [x0], #32 #else ldr q8, =0x300020001/* addends 1,2,3[,0] */ - dup v7.4s, w5 + dup v7.4s, w8 mov v0.16b, v4.16b add v7.4s, v7.4s, v8.4s mov v1.16b, v4.16b @@ -323,18 +316,12 @@ AES_ENTRY(aes_ctr_encrypt) eor v2.16b, v7.16b, v2.16b eor v3.16b, v5.16b, v3.16b st1 {v0.16b-v3.16b}, [x0], #64 - add x5, x5, #INTERLEAVE + add x8, x8, #INTERLEAVE #endif - cbz w4, .LctroutNx -.LctrincNx: - rev
Re: [RFC PATCH 0/6] Add bulk skcipher requests to crypto API and dm-crypt
2017-01-13 15:29 GMT+01:00 Herbert Xu: > What if the driver had hardware support for generating these IVs? > With your scheme this cannot be supported at all. That's true... I'm starting to think that this isn't really a good idea. I was mainly trying to keep the door open for the random IV support and also to keep the multi-key stuff (which was really only intended for loop-AES partition support) out of the crypto API, but both of these can be probably solved in a better way... > Getting the IVs back is not actually that hard. We could simply > change the algorithm definition for the IV generator so that > the IVs are embedded in the plaintext and ciphertext. For > example, you could declare it so that the for n sectors the > first n*ivsize bytes would be the IV, and the actual plaintext > or ciphertext would follow. > > With such a definition you could either generate the IVs in dm-crypt > or have them generated in the IV generator. That seems kind of hacky to me... but if that's what you prefer, then so be it. Cheers, Ondrej > > Cheers, > -- > Email: Herbert Xu > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 6/6] dm-crypt: Add bulk crypto processing support
Hi Binoy, 2017-01-16 9:37 GMT+01:00 Binoy Jayan: > The initial goal of our proposal was to process the encryption requests with > the > maximum possible block sizes with a hardware which has automated iv generation > capabilities. But when it is done in software, and if the bulk > requests are processed > sequentially, one block at a time, the memory foot print could be > reduced even if > the bulk request exceeds a page. While your patch looks good, there > are couple of > drawbacks one of which is the maximum size of a bulk request is a page. This > could limit the capability of the crypto hardware. If the whole bio is > processed at > once, which is what qualcomm's version of dm-req-crypt does, it achieves an > even > better performance. I see... well, I added the limit only so that the async fallback implementation can allocate multiple requests, so they can be processed in parallel, as they would be in the current dm-crypt code. I'm not really sure if that brings any benefit, but I guess if some HW accelerator has multiple engines, then this allows distributing the work among them. (I wonder how switching to the crypto API's IV generation will affect the situation for drivers that can process requests in parallel, but do not support the IV generators...) I could remove the limit and switch the fallback to sequential processing (or maybe even allocate the requests from a mempool, the way dm-crypt does it now...), but after Herbert's feedback I'm probably going to scrap this patchset anyway... >> Note that if the 'keycount' parameter of the cipher specification is set to a >> value other than 1, dm-crypt still sends only one sector in each request, >> since >> in such case the neighboring sectors are encrypted with different keys. > > This could be avoided if the key management is done at the crypto layer. Yes, but remember that the only reasonable use-case for using keycount != 1 is mounting loop-AES partitions (which is kind of a legacy format, so there is not much point in making HW drivers for it). It is an unfortunate consequence of Milan's decision to make keycount an independent part of the cipher specification (instead of making it specific for the LMK mode), that all the other IV modes are now 'polluted' with the requirement to support it. I discussed with Milan the possibility of deprecating the keycount parameter (i.e. allowing only value of 64 for LMK and 1 for all the other IV modes) and then converting the IV modes to skciphers (or IV generators, or some combination of both). This would significantly simplify the key management and allow for better optimization strategies. However, I don't know if such change would be accepted by device-mapper maintainers, since it may break someone's unusual dm-crypt configuration... Cheers, Ondrej -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] crypto: generic/cts - fix regression in iv handling
On Tue, Jan 17, 2017 at 09:20:11AM +, Ard Biesheuvel wrote: > > So to be clear, it is part of the API that after calling > crypto_skcipher_encrypt(req), and completing the request, req->iv > should contain a value that could potentially be used to encrypt > additional data? That sounds highly specific to CBC (e.g., this could > never work with XTS, since the tweak generation is only performed > once), so it does not make sense for skciphers in general. For > instance, drivers for h/w peripherals that never need to map the data > to begin with (since they only pass the physical addresses to the > hardware) will need to explicitly map the destination buffer to > retrieve those bytes, on the off chance that the transform may be > wrapped by CTS. Yes this is part of the API. There was a patch to test this in testmgr but I wanted to give the drivers some more time before adding it. It isn't just CBC that uses chaining. Other modes such as CTR use it too. Disk encryption in general don't chaining but that's because they are sector-oriented. Cheers, -- Email: Herbert XuHome Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] crypto: generic/cts - fix regression in iv handling
On Tue, Jan 17, 2017 at 09:30:30AM +, Ard Biesheuvel wrote: > > Got a link? http://lkml.iu.edu/hypermail/linux/kernel/1506.2/00346.html > OK, so that means chaining skcipher_set_crypt() calls, where req->iv > is passed on between requests? Are there chaining modes beyond > cts(cbc) encryption that rely on this? I think algif_skcipher relies on this too. > It is easily fixed in the chaining mode code, so I'm perfectly happy > to fix it there instead, but I'd like to understand the requirements > exactly before doing so Thanks, -- Email: Herbert XuHome Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] crypto: generic/cts - fix regression in iv handling
On Mon, Jan 16, 2017 at 09:16:35AM +, Ard Biesheuvel wrote: > Since the skcipher conversion in commit 0605c41cc53c ("crypto: > cts - Convert to skcipher"), the cts code tacitly assumes that > the underlying CBC encryption transform performed on the first > part of the plaintext returns an IV in req->iv that is suitable > for encrypting the final bit. > > While this is usually the case, it is not mandated by the API, and > given that the CTS code already accesses the ciphertext scatterlist > to retrieve those bytes, we can simply copy them into req->iv before > proceeding. Ugh while there are some legacy drivers that break this is certainly part of the API. Which underlying CBC implementation is breaking this? Thanks, -- Email: Herbert XuHome Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] crypto: generic/cts - fix regression in iv handling
On 17 January 2017 at 09:25, Herbert Xuwrote: > On Tue, Jan 17, 2017 at 09:20:11AM +, Ard Biesheuvel wrote: >> >> So to be clear, it is part of the API that after calling >> crypto_skcipher_encrypt(req), and completing the request, req->iv >> should contain a value that could potentially be used to encrypt >> additional data? That sounds highly specific to CBC (e.g., this could >> never work with XTS, since the tweak generation is only performed >> once), so it does not make sense for skciphers in general. For >> instance, drivers for h/w peripherals that never need to map the data >> to begin with (since they only pass the physical addresses to the >> hardware) will need to explicitly map the destination buffer to >> retrieve those bytes, on the off chance that the transform may be >> wrapped by CTS. > > Yes this is part of the API. There was a patch to test this in > testmgr but I wanted to give the drivers some more time before > adding it. > Got a link? > It isn't just CBC that uses chaining. Other modes such as CTR > use it too. Disk encryption in general don't chaining but that's > because they are sector-oriented. > OK, so that means chaining skcipher_set_crypt() calls, where req->iv is passed on between requests? Are there chaining modes beyond cts(cbc) encryption that rely on this? It is easily fixed in the chaining mode code, so I'm perfectly happy to fix it there instead, but I'd like to understand the requirements exactly before doing so -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] crypto: generic/cts - fix regression in iv handling
On 17 January 2017 at 09:11, Herbert Xuwrote: > On Mon, Jan 16, 2017 at 09:16:35AM +, Ard Biesheuvel wrote: >> Since the skcipher conversion in commit 0605c41cc53c ("crypto: >> cts - Convert to skcipher"), the cts code tacitly assumes that >> the underlying CBC encryption transform performed on the first >> part of the plaintext returns an IV in req->iv that is suitable >> for encrypting the final bit. >> >> While this is usually the case, it is not mandated by the API, and >> given that the CTS code already accesses the ciphertext scatterlist >> to retrieve those bytes, we can simply copy them into req->iv before >> proceeding. > > Ugh while there are some legacy drivers that break this is certainly > part of the API. > > Which underlying CBC implementation is breaking this? > arch/arm64/crypto/aes-modes.S does not return the IV back to the caller, so cts(cbc-aes-ce) is currently broken. So to be clear, it is part of the API that after calling crypto_skcipher_encrypt(req), and completing the request, req->iv should contain a value that could potentially be used to encrypt additional data? That sounds highly specific to CBC (e.g., this could never work with XTS, since the tweak generation is only performed once), so it does not make sense for skciphers in general. For instance, drivers for h/w peripherals that never need to map the data to begin with (since they only pass the physical addresses to the hardware) will need to explicitly map the destination buffer to retrieve those bytes, on the off chance that the transform may be wrapped by CTS. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html