Re: [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
On 2 December 2017 at 13:59, Peter Zijlstrawrote: > On Sat, Dec 02, 2017 at 11:15:14AM +, Ard Biesheuvel wrote: >> On 2 December 2017 at 09:11, Ard Biesheuvel >> wrote: > >> > They consume the entire input in a single go, yes. But making it more >> > granular than that is going to hurt performance, unless we introduce >> > some kind of kernel_neon_yield(), which does a end+begin but only if >> > the task is being scheduled out. >> > >> > For example, the SHA256 keeps 256 bytes of round constants in NEON >> > registers, and reloading those from memory for each 64 byte block of >> > input is going to be noticeable. The same applies to the AES code >> > (although the numbers are slightly different) >> >> Something like below should do the trick I think (apologies for the >> patch soup). I.e., check TIF_NEED_RESCHED at a point where only very >> few NEON registers are live, and preserve/restore the live registers >> across calls to kernel_neon_end + kernel_neon_begin. Would that work >> for RT? > > Probably yes. The important point is that preempt latencies (and thus by > extension NEON regions) are bounded and preferably small. > > Unbounded stuff (like depends on the amount of data fed) are a complete > no-no for RT since then you cannot make predictions on how long things > will take. > OK, that makes sense. But I do wonder what the parameters should be here. For instance, the AES instructions on ARMv8 operate at <1 cycle per byte, and so checking the TIF_NEED_RESCHED flag for every iteration of the inner loop (i.e., every 64 bytes ~ 64 cycles) is clearly going to be noticeable, and is probably overkill. The pure NEON version (which is instantiated from the same block mode wrappers) uses ~25 cycles per byte, and the bit sliced NEON version runs at ~20 cycles per byte but can only operate at 8 blocks (128 bytes) at a time. So rather than simply polling the bit at each iteration of the inner loop in each algorithm, I'd prefer to aim for a ballpark number of cycles to execute, in the order 1000 - 2000. Would that be OK or too coarse?
Re: [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
On Sat, Dec 02, 2017 at 11:15:14AM +, Ard Biesheuvel wrote: > On 2 December 2017 at 09:11, Ard Biesheuvelwrote: > > They consume the entire input in a single go, yes. But making it more > > granular than that is going to hurt performance, unless we introduce > > some kind of kernel_neon_yield(), which does a end+begin but only if > > the task is being scheduled out. > > > > For example, the SHA256 keeps 256 bytes of round constants in NEON > > registers, and reloading those from memory for each 64 byte block of > > input is going to be noticeable. The same applies to the AES code > > (although the numbers are slightly different) > > Something like below should do the trick I think (apologies for the > patch soup). I.e., check TIF_NEED_RESCHED at a point where only very > few NEON registers are live, and preserve/restore the live registers > across calls to kernel_neon_end + kernel_neon_begin. Would that work > for RT? Probably yes. The important point is that preempt latencies (and thus by extension NEON regions) are bounded and preferably small. Unbounded stuff (like depends on the amount of data fed) are a complete no-no for RT since then you cannot make predictions on how long things will take.
Re: [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
On Sat, Dec 02, 2017 at 09:11:46AM +, Ard Biesheuvel wrote: > On 2 December 2017 at 09:01, Peter Zijlstrawrote: > > On Fri, Dec 01, 2017 at 09:19:22PM +, Ard Biesheuvel wrote: > >> Note that the remaining crypto drivers simply operate on fixed buffers, so > >> while the RT crowd may still feel the need to disable those (and the ones > >> below as well, perhaps), they don't call back into the crypto layer like > >> the ones updated by this series, and so there's no room for improvement > >> there AFAICT. > > > > Do these other drivers process all the blocks fed to them in one go > > under a single NEON section, or do they do a single fixed block per > > NEON invocation? > > They consume the entire input in a single go, yes. But making it more > granular than that is going to hurt performance, unless we introduce > some kind of kernel_neon_yield(), which does a end+begin but only if > the task is being scheduled out. A little something like this: https://lkml.kernel.org/r/20171201113235.6tmkwtov5cg2l...@hirez.programming.kicks-ass.net > For example, the SHA256 keeps 256 bytes of round constants in NEON > registers, and reloading those from memory for each 64 byte block of > input is going to be noticeable. The same applies to the AES code > (although the numbers are slightly different) Quite. We could augment the above function with a return value that says if we actually did a end/begin and registers were clobbered.
Re: [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
On 2 December 2017 at 09:11, Ard Biesheuvelwrote: > On 2 December 2017 at 09:01, Peter Zijlstra wrote: >> On Fri, Dec 01, 2017 at 09:19:22PM +, Ard Biesheuvel wrote: >>> Note that the remaining crypto drivers simply operate on fixed buffers, so >>> while the RT crowd may still feel the need to disable those (and the ones >>> below as well, perhaps), they don't call back into the crypto layer like >>> the ones updated by this series, and so there's no room for improvement >>> there AFAICT. >> >> Do these other drivers process all the blocks fed to them in one go >> under a single NEON section, or do they do a single fixed block per >> NEON invocation? > > They consume the entire input in a single go, yes. But making it more > granular than that is going to hurt performance, unless we introduce > some kind of kernel_neon_yield(), which does a end+begin but only if > the task is being scheduled out. > > For example, the SHA256 keeps 256 bytes of round constants in NEON > registers, and reloading those from memory for each 64 byte block of > input is going to be noticeable. The same applies to the AES code > (although the numbers are slightly different) Something like below should do the trick I think (apologies for the patch soup). I.e., check TIF_NEED_RESCHED at a point where only very few NEON registers are live, and preserve/restore the live registers across calls to kernel_neon_end + kernel_neon_begin. Would that work for RT? diff --git a/arch/arm64/crypto/sha2-ce-core.S b/arch/arm64/crypto/sha2-ce-core.S index 679c6c002f4f..4f12038574f3 100644 --- a/arch/arm64/crypto/sha2-ce-core.S +++ b/arch/arm64/crypto/sha2-ce-core.S @@ -77,6 +77,10 @@ * int blocks) */ ENTRY(sha2_ce_transform) + stp x29, x30, [sp, #-48]! + mov x29, sp + +restart: /* load round constants */ adr x8, .Lsha2_rcon ld1 { v0.4s- v3.4s}, [x8], #64 @@ -129,14 +133,17 @@ CPU_LE( rev32 v19.16b, v19.16b ) add dgbv.4s, dgbv.4s, dg1v.4s /* handled all input blocks? */ - cbnz w2, 0b + cbz w2, 2f + + tif_need_resched 4f, 5 + b 0b /* * Final block: add padding and total bit count. * Skip if the input size was not a round multiple of the block size, * the padding is handled by the C code in that case. */ - cbz x4, 3f +2: cbz x4, 3f ldr_l w4, sha256_ce_offsetof_count, x4 ldr x4, [x0, x4] movi v17.2d, #0 @@ -151,5 +158,15 @@ CPU_LE( rev32 v19.16b, v19.16b ) /* store new state */ 3: st1 {dgav.4s, dgbv.4s}, [x0] + ldp x29, x30, [sp], #48 ret + +4: st1 {dgav.4s, dgbv.4s}, [x0] + stp x0, x1, [sp, #16] + stp x2, x4, [sp, #32] + bl kernel_neon_end + bl kernel_neon_begin + ldp x0, x1, [sp, #16] + ldp x2, x4, [sp, #32] + b restart ENDPROC(sha2_ce_transform) diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h index aef72d886677..e3e7e15ebefd 100644 --- a/arch/arm64/include/asm/assembler.h +++ b/arch/arm64/include/asm/assembler.h @@ -512,4 +512,15 @@ alternative_else_nop_endif #endif .endm +/* + * Check TIF_NEED_RESCHED flag from assembler (for kernel mode NEON) + */ + .macro tif_need_resched, lbl:req, regnum:req +#ifdef CONFIG_PREEMPT + get_thread_info x\regnum + ldr w\regnum, [x\regnum, #TSK_TI_FLAGS] // get flags + tbnz w\regnum, #TIF_NEED_RESCHED, \lbl // needs rescheduling? +#endif + .endm + #endif /* __ASM_ASSEMBLER_H */
Re: [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
On 2 December 2017 at 09:01, Peter Zijlstrawrote: > On Fri, Dec 01, 2017 at 09:19:22PM +, Ard Biesheuvel wrote: >> Note that the remaining crypto drivers simply operate on fixed buffers, so >> while the RT crowd may still feel the need to disable those (and the ones >> below as well, perhaps), they don't call back into the crypto layer like >> the ones updated by this series, and so there's no room for improvement >> there AFAICT. > > Do these other drivers process all the blocks fed to them in one go > under a single NEON section, or do they do a single fixed block per > NEON invocation? They consume the entire input in a single go, yes. But making it more granular than that is going to hurt performance, unless we introduce some kind of kernel_neon_yield(), which does a end+begin but only if the task is being scheduled out. For example, the SHA256 keeps 256 bytes of round constants in NEON registers, and reloading those from memory for each 64 byte block of input is going to be noticeable. The same applies to the AES code (although the numbers are slightly different)
Re: [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
On Fri, Dec 01, 2017 at 09:19:22PM +, Ard Biesheuvel wrote: > Note that the remaining crypto drivers simply operate on fixed buffers, so > while the RT crowd may still feel the need to disable those (and the ones > below as well, perhaps), they don't call back into the crypto layer like > the ones updated by this series, and so there's no room for improvement > there AFAICT. Do these other drivers process all the blocks fed to them in one go under a single NEON section, or do they do a single fixed block per NEON invocation?
[PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
As reported by Sebastian, the way the arm64 NEON crypto code currently keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is causing problems with RT builds, given that the skcipher walk API may allocate and free temporary buffers it uses to present the input and output arrays to the crypto algorithm in blocksize sized chunks (where blocksize is the natural blocksize of the crypto algorithm), and doing so with NEON enabled means we're alloc/free'ing memory with preemption disabled. This was deliberate: when this code was introduced, each kernel_neon_begin() and kernel_neon_end() call incurred a fixed penalty of storing resp. loading the contents of all NEON registers to/from memory, and so doing it less often had an obvious performance benefit. However, in the mean time, we have refactored the core kernel mode NEON code, and now kernel_neon_begin() only incurs this penalty the first time it is called after entering the kernel, and the NEON register restore is deferred until returning to userland. This means pulling those calls into the loops that iterate over the input/output of the crypto algorithm is not a big deal anymore (although there are some places in the code where we relied on the NEON registers retaining their values between calls) So let's clean this up for arm64: update the NEON based skcipher drivers to no longer keep the NEON enabled when calling into the skcipher walk API. Note that the remaining crypto drivers simply operate on fixed buffers, so while the RT crowd may still feel the need to disable those (and the ones below as well, perhaps), they don't call back into the crypto layer like the ones updated by this series, and so there's no room for improvement there AFAICT. Cc: Dave MartinCc: Russell King - ARM Linux Cc: Sebastian Andrzej Siewior Cc: Mark Rutland Cc: linux-rt-us...@vger.kernel.org Cc: Peter Zijlstra Cc: Catalin Marinas Cc: Will Deacon Cc: Steven Rostedt Cc: Thomas Gleixner Ard Biesheuvel (5): crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop crypto: arm64/aes-blk - move kernel mode neon en/disable into loop crypto: arm64/aes-bs - move kernel mode neon en/disable into loop crypto: arm64/chacha20 - move kernel mode neon en/disable into loop crypto: arm64/ghash - move kernel mode neon en/disable into loop arch/arm64/crypto/aes-ce-ccm-glue.c| 47 +- arch/arm64/crypto/aes-glue.c | 81 +- arch/arm64/crypto/aes-modes.S | 90 ++-- arch/arm64/crypto/aes-neonbs-glue.c| 38 - arch/arm64/crypto/chacha20-neon-glue.c | 4 +- arch/arm64/crypto/ghash-ce-glue.c | 9 +- 6 files changed, 132 insertions(+), 137 deletions(-) -- 2.11.0