from:"Ard Biesheuvel"

Re: WARNING: kernel stack regs has bad 'bp' value (3)

2018-05-12 Thread Ard Biesheuvel

On 12 May 2018 at 11:50, Dmitry Vyukov <dvyu...@google.com> wrote:
> On Sat, May 12, 2018 at 11:09 AM, Ard Biesheuvel
> <ard.biesheu...@linaro.org> wrote:
>> (+ Arnd)
>>
>> On 12 May 2018 at 10:43, Dmitry Vyukov <dvyu...@google.com> wrote:
>>> On Fri, Feb 2, 2018 at 11:18 PM, Eric Biggers <ebigge...@gmail.com> wrote:
>>>> On Fri, Feb 02, 2018 at 02:57:32PM +0100, Dmitry Vyukov wrote:
>>>>> On Fri, Feb 2, 2018 at 2:48 PM, syzbot
>>>>> <syzbot+ffa3a158337bbc01f...@syzkaller.appspotmail.com> wrote:
>>>>> > Hello,
>>>>> >
>>>>> > syzbot hit the following crash on upstream commit
>>>>> > 7109a04eae81c41ed529da9f3c48c3655ccea741 (Thu Feb 1 17:37:30 2018 +)
>>>>> > Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide
>>>>> >
>>>>> > So far this crash happened 4 times on net-next, upstream.
>>>>> > C reproducer is attached.
>>>>> > syzkaller reproducer is attached.
>>>>> > Raw console output is attached.
>>>>> > compiler: gcc (GCC) 7.1.1 20170620
>>>>> > .config is attached.
>>>>>
>>>>>
>>>>> From suspicious frames I see salsa20_asm_crypt there, so +crypto 
>>>>> maintainers.
>>>>>
>>>>
>>>> Looks like the x86 implementations of Salsa20 (both i586 and x86_64) need 
>>>> to be
>>>> updated to not use %ebp/%rbp.
>>>
>>> Ard,
>>>
>>> This was bisected as introduced by:
>>>
>>> commit 83dee2ce1ae791c3dc0c9d4d3a8d42cb109613f6
>>> Author: Ard Biesheuvel <ard.biesheu...@linaro.org>
>>> Date:   Fri Jan 19 12:04:34 2018 +
>>>
>>> crypto: sha3-generic - rewrite KECCAK transform to help the
>>> compiler optimize
>>>
>>> https://gist.githubusercontent.com/dvyukov/47f93f5a0679170dddf93bc019b42f6d/raw/65beac8ddd30003bbd4e9729236dc8572094abf7/gistfile1.txt
>>
>> Ouch.
>>
>> I'm not an expert in x86 assembly. Could someone please check the
>> generated code to see what's going on? The C code changes are not that
>> intricate, they basically unroll a loop, replacing accesses to
>> 'array[indirect_index[i]]' with 'array[constant]'.
>>
>> As mentioned in the commit log, the speedup is more than significant
>> for architectures with lots of GPRs so I'd prefer fixing the patch
>> over reverting it (if there is anything wrong with the code in the
>> first place)
>
> I suspect the problem is with __attribute__((__optimize__("O3"))). It
> makes compiler use rbp register, which must not be used.

IIRC, the additional speedup from adding that was significant but not
huge. Given that we don't use O3 anywhere else, I guess we should just
remove it.

Could you please check whether that makes the issue go away?

If so,

Acked-by: Ard Biesheuvel <ard.biesheu...@linaro.org>

for any patch that removes the O3 attribute override from keccakf()

Thanks,
Ard.

Re: WARNING: kernel stack regs has bad 'bp' value (3)

2018-05-12 Thread Ard Biesheuvel

(+ Arnd)

On 12 May 2018 at 10:43, Dmitry Vyukov <dvyu...@google.com> wrote:
> On Fri, Feb 2, 2018 at 11:18 PM, Eric Biggers <ebigge...@gmail.com> wrote:
>> On Fri, Feb 02, 2018 at 02:57:32PM +0100, Dmitry Vyukov wrote:
>>> On Fri, Feb 2, 2018 at 2:48 PM, syzbot
>>> <syzbot+ffa3a158337bbc01f...@syzkaller.appspotmail.com> wrote:
>>> > Hello,
>>> >
>>> > syzbot hit the following crash on upstream commit
>>> > 7109a04eae81c41ed529da9f3c48c3655ccea741 (Thu Feb 1 17:37:30 2018 +)
>>> > Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide
>>> >
>>> > So far this crash happened 4 times on net-next, upstream.
>>> > C reproducer is attached.
>>> > syzkaller reproducer is attached.
>>> > Raw console output is attached.
>>> > compiler: gcc (GCC) 7.1.1 20170620
>>> > .config is attached.
>>>
>>>
>>> From suspicious frames I see salsa20_asm_crypt there, so +crypto 
>>> maintainers.
>>>
>>
>> Looks like the x86 implementations of Salsa20 (both i586 and x86_64) need to 
>> be
>> updated to not use %ebp/%rbp.
>
> Ard,
>
> This was bisected as introduced by:
>
> commit 83dee2ce1ae791c3dc0c9d4d3a8d42cb109613f6
> Author: Ard Biesheuvel <ard.biesheu...@linaro.org>
> Date:   Fri Jan 19 12:04:34 2018 +
>
> crypto: sha3-generic - rewrite KECCAK transform to help the
> compiler optimize
>
> https://gist.githubusercontent.com/dvyukov/47f93f5a0679170dddf93bc019b42f6d/raw/65beac8ddd30003bbd4e9729236dc8572094abf7/gistfile1.txt

Ouch.

I'm not an expert in x86 assembly. Could someone please check the
generated code to see what's going on? The C code changes are not that
intricate, they basically unroll a loop, replacing accesses to
'array[indirect_index[i]]' with 'array[constant]'.

As mentioned in the commit log, the speedup is more than significant
for architectures with lots of GPRs so I'd prefer fixing the patch
over reverting it (if there is anything wrong with the code in the
first place)

-- 
Ard.

[PATCH resend 10/10] crypto: arm64/sha512-ce - yield NEON after every block of input

2018-04-30 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/sha512-ce-core.S | 27 +++-
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/crypto/sha512-ce-core.S 
b/arch/arm64/crypto/sha512-ce-core.S
index 7f3bca5c59a2..ce65e3abe4f2 100644
--- a/arch/arm64/crypto/sha512-ce-core.S
+++ b/arch/arm64/crypto/sha512-ce-core.S
@@ -107,17 +107,23 @@
 */
.text
 ENTRY(sha512_ce_transform)
+   frame_push  3
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+
/* load state */
-   ld1 {v8.2d-v11.2d}, [x0]
+0: ld1 {v8.2d-v11.2d}, [x19]
 
/* load first 4 round constants */
adr_l   x3, .Lsha512_rcon
ld1 {v20.2d-v23.2d}, [x3], #64
 
/* load input */
-0: ld1 {v12.2d-v15.2d}, [x1], #64
-   ld1 {v16.2d-v19.2d}, [x1], #64
-   sub w2, w2, #1
+1: ld1 {v12.2d-v15.2d}, [x20], #64
+   ld1 {v16.2d-v19.2d}, [x20], #64
+   sub w21, w21, #1
 
 CPU_LE(rev64   v12.16b, v12.16b)
 CPU_LE(rev64   v13.16b, v13.16b)
@@ -196,9 +202,18 @@ CPU_LE(rev64   v19.16b, v19.16b)
add v11.2d, v11.2d, v3.2d
 
/* handled all input blocks? */
-   cbnzw2, 0b
+   cbz w21, 3f
+
+   if_will_cond_yield_neon
+   st1 {v8.2d-v11.2d}, [x19]
+   do_cond_yield_neon
+   b   0b
+   endif_yield_neon
+
+   b   1b
 
/* store new state */
-3: st1 {v8.2d-v11.2d}, [x0]
+3: st1 {v8.2d-v11.2d}, [x19]
+   frame_pop
ret
 ENDPROC(sha512_ce_transform)
-- 
2.17.0

[PATCH resend 07/10] crypto: arm64/crc32-ce - yield NEON after every block of input

2018-04-30 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/crc32-ce-core.S | 40 +++-
 1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/crypto/crc32-ce-core.S 
b/arch/arm64/crypto/crc32-ce-core.S
index 16ed3c7ebd37..8061bf0f9c66 100644
--- a/arch/arm64/crypto/crc32-ce-core.S
+++ b/arch/arm64/crypto/crc32-ce-core.S
@@ -100,9 +100,10 @@
dCONSTANT   .reqd0
qCONSTANT   .reqq0
 
-   BUF .reqx0
-   LEN .reqx1
-   CRC .reqx2
+   BUF .reqx19
+   LEN .reqx20
+   CRC .reqx21
+   CONST   .reqx22
 
vzr .reqv9
 
@@ -123,7 +124,14 @@ ENTRY(crc32_pmull_le)
 ENTRY(crc32c_pmull_le)
adr_l   x3, .Lcrc32c_constants
 
-0: bic LEN, LEN, #15
+0: frame_push  4, 64
+
+   mov BUF, x0
+   mov LEN, x1
+   mov CRC, x2
+   mov CONST, x3
+
+   bic LEN, LEN, #15
ld1 {v1.16b-v4.16b}, [BUF], #0x40
movivzr.16b, #0
fmovdCONSTANT, CRC
@@ -132,7 +140,7 @@ ENTRY(crc32c_pmull_le)
cmp LEN, #0x40
b.ltless_64
 
-   ldr qCONSTANT, [x3]
+   ldr qCONSTANT, [CONST]
 
 loop_64:   /* 64 bytes Full cache line folding */
sub LEN, LEN, #0x40
@@ -162,10 +170,21 @@ loop_64:  /* 64 bytes Full cache line folding */
eor v4.16b, v4.16b, v8.16b
 
cmp LEN, #0x40
-   b.geloop_64
+   b.ltless_64
+
+   if_will_cond_yield_neon
+   stp q1, q2, [sp, #.Lframe_local_offset]
+   stp q3, q4, [sp, #.Lframe_local_offset + 32]
+   do_cond_yield_neon
+   ldp q1, q2, [sp, #.Lframe_local_offset]
+   ldp q3, q4, [sp, #.Lframe_local_offset + 32]
+   ldr qCONSTANT, [CONST]
+   movivzr.16b, #0
+   endif_yield_neon
+   b   loop_64
 
 less_64:   /* Folding cache line into 128bit */
-   ldr qCONSTANT, [x3, #16]
+   ldr qCONSTANT, [CONST, #16]
 
pmull2  v5.1q, v1.2d, vCONSTANT.2d
pmull   v1.1q, v1.1d, vCONSTANT.1d
@@ -204,8 +223,8 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
 
/* final 32-bit fold */
-   ldr dCONSTANT, [x3, #32]
-   ldr d3, [x3, #40]
+   ldr dCONSTANT, [CONST, #32]
+   ldr d3, [CONST, #40]
 
ext v2.16b, v1.16b, vzr.16b, #4
and v1.16b, v1.16b, v3.16b
@@ -213,7 +232,7 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
 
/* Finish up with the bit-reversed barrett reduction 64 ==> 32 bits */
-   ldr qCONSTANT, [x3, #48]
+   ldr qCONSTANT, [CONST, #48]
 
and v2.16b, v1.16b, v3.16b
ext v2.16b, vzr.16b, v2.16b, #8
@@ -223,6 +242,7 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
mov w0, v1.s[1]
 
+   frame_pop
ret
 ENDPROC(crc32_pmull_le)
 ENDPROC(crc32c_pmull_le)
-- 
2.17.0

[PATCH resend 09/10] crypto: arm64/sha3-ce - yield NEON after every block of input

2018-04-30 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/sha3-ce-core.S | 77 +---
 1 file changed, 50 insertions(+), 27 deletions(-)

diff --git a/arch/arm64/crypto/sha3-ce-core.S b/arch/arm64/crypto/sha3-ce-core.S
index 332ad7530690..a7d587fa54f6 100644
--- a/arch/arm64/crypto/sha3-ce-core.S
+++ b/arch/arm64/crypto/sha3-ce-core.S
@@ -41,9 +41,16 @@
 */
.text
 ENTRY(sha3_ce_transform)
-   /* load state */
-   add x8, x0, #32
-   ld1 { v0.1d- v3.1d}, [x0]
+   frame_push  4
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+
+0: /* load state */
+   add x8, x19, #32
+   ld1 { v0.1d- v3.1d}, [x19]
ld1 { v4.1d- v7.1d}, [x8], #32
ld1 { v8.1d-v11.1d}, [x8], #32
ld1 {v12.1d-v15.1d}, [x8], #32
@@ -51,13 +58,13 @@ ENTRY(sha3_ce_transform)
ld1 {v20.1d-v23.1d}, [x8], #32
ld1 {v24.1d}, [x8]
 
-0: sub w2, w2, #1
+1: sub w21, w21, #1
mov w8, #24
adr_l   x9, .Lsha3_rcon
 
/* load input */
-   ld1 {v25.8b-v28.8b}, [x1], #32
-   ld1 {v29.8b-v31.8b}, [x1], #24
+   ld1 {v25.8b-v28.8b}, [x20], #32
+   ld1 {v29.8b-v31.8b}, [x20], #24
eor v0.8b, v0.8b, v25.8b
eor v1.8b, v1.8b, v26.8b
eor v2.8b, v2.8b, v27.8b
@@ -66,10 +73,10 @@ ENTRY(sha3_ce_transform)
eor v5.8b, v5.8b, v30.8b
eor v6.8b, v6.8b, v31.8b
 
-   tbnzx3, #6, 2f  // SHA3-512
+   tbnzx22, #6, 3f // SHA3-512
 
-   ld1 {v25.8b-v28.8b}, [x1], #32
-   ld1 {v29.8b-v30.8b}, [x1], #16
+   ld1 {v25.8b-v28.8b}, [x20], #32
+   ld1 {v29.8b-v30.8b}, [x20], #16
eor  v7.8b,  v7.8b, v25.8b
eor  v8.8b,  v8.8b, v26.8b
eor  v9.8b,  v9.8b, v27.8b
@@ -77,34 +84,34 @@ ENTRY(sha3_ce_transform)
eor v11.8b, v11.8b, v29.8b
eor v12.8b, v12.8b, v30.8b
 
-   tbnzx3, #4, 1f  // SHA3-384 or SHA3-224
+   tbnzx22, #4, 2f // SHA3-384 or SHA3-224
 
// SHA3-256
-   ld1 {v25.8b-v28.8b}, [x1], #32
+   ld1 {v25.8b-v28.8b}, [x20], #32
eor v13.8b, v13.8b, v25.8b
eor v14.8b, v14.8b, v26.8b
eor v15.8b, v15.8b, v27.8b
eor v16.8b, v16.8b, v28.8b
-   b   3f
+   b   4f
 
-1: tbz x3, #2, 3f  // bit 2 cleared? SHA-384
+2: tbz x22, #2, 4f // bit 2 cleared? SHA-384
 
// SHA3-224
-   ld1 {v25.8b-v28.8b}, [x1], #32
-   ld1 {v29.8b}, [x1], #8
+   ld1 {v25.8b-v28.8b}, [x20], #32
+   ld1 {v29.8b}, [x20], #8
eor v13.8b, v13.8b, v25.8b
eor v14.8b, v14.8b, v26.8b
eor v15.8b, v15.8b, v27.8b
eor v16.8b, v16.8b, v28.8b
eor v17.8b, v17.8b, v29.8b
-   b   3f
+   b   4f
 
// SHA3-512
-2: ld1 {v25.8b-v26.8b}, [x1], #16
+3: ld1 {v25.8b-v26.8b}, [x20], #16
eor  v7.8b,  v7.8b, v25.8b
eor  v8.8b,  v8.8b, v26.8b
 
-3: sub w8, w8, #1
+4: sub w8, w8, #1
 
eor3v29.16b,  v4.16b,  v9.16b, v14.16b
eor3v26.16b,  v1.16b,  v6.16b, v11.16b
@@ -183,17 +190,33 @@ ENTRY(sha3_ce_transform)
 
eor  v0.16b,  v0.16b, v31.16b
 
-   cbnzw8, 3b
-   cbnzw2, 0b
+   cbnzw8, 4b
+   cbz w21, 5f
+
+   if_will_cond_yield_neon
+   add x8, x19, #32
+   st1 { v0.1d- v3.1d}, [x19]
+   st1 { v4.1d- v7.1d}, [x8], #32
+   st1 { v8.1d-v11.1d}, [x8], #32
+   st1 {v12.1d-v15.1d}, [x8], #32
+   st1 {v16.1d-v19.1d}, [x8], #32
+   st1 {v20.1d-v23.1d}, [x8], #32
+   st1 {v24.1d}, [x8]
+   do_cond_yield_neon
+   b   0b
+   endif_yield_neon
+
+   b   1b
 
/* save state */
-   st1 { v0.1d- v3.1d}, [x0], #32
-   st1 { v4.1d- v7.1d}, [x0], #32
-   st1 { v8.1d-v11.1d}, [x0], #32
-   st1 {v12.1d-v15.1d}, [x0], #32
-   st1 {v16.1d-v19.1d}, [x0], #32
-   st1 {v20.1d-v23.1d}, [x0], #32
-   st1 {v24.1d}, [x0]
+5: st1 { v0.1d- v3.1d}, [x19], #32
+   st1 { v4.1d- v7.1d}, [x19], #32
+   st1 { v8.1d-v11.1d}, [x19], #32
+   st1 {v12.1d-v15.1d}, [x19], #32
+   st1 {v16.1d-v19.1d}, [x19], #32
+   st1 {v20.1d-v23.1d}, [x19], #32
+   st1 {v24.1d}, [x19]
+   frame_pop
ret
 ENDPROC(sha3_ce_transform)
 
-- 
2.17.0

[PATCH resend 06/10] crypto: arm64/aes-ghash - yield NEON after every block of input

2018-04-30 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/ghash-ce-core.S | 113 ++--
 arch/arm64/crypto/ghash-ce-glue.c |  28 +++--
 2 files changed, 97 insertions(+), 44 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-core.S 
b/arch/arm64/crypto/ghash-ce-core.S
index 11ebf1ae248a..dcffb9e77589 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -213,22 +213,31 @@
.endm
 
.macro  __pmull_ghash, pn
-   ld1 {SHASH.2d}, [x3]
-   ld1 {XL.2d}, [x1]
+   frame_push  5
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+
+0: ld1 {SHASH.2d}, [x22]
+   ld1 {XL.2d}, [x20]
ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
eor SHASH2.16b, SHASH2.16b, SHASH.16b
 
__pmull_pre_\pn
 
/* do the head block first, if supplied */
-   cbz x4, 0f
-   ld1 {T1.2d}, [x4]
-   b   1f
+   cbz x23, 1f
+   ld1 {T1.2d}, [x23]
+   mov x23, xzr
+   b   2f
 
-0: ld1 {T1.2d}, [x2], #16
-   sub w0, w0, #1
+1: ld1 {T1.2d}, [x21], #16
+   sub w19, w19, #1
 
-1: /* multiply XL by SHASH in GF(2^128) */
+2: /* multiply XL by SHASH in GF(2^128) */
 CPU_LE(rev64   T1.16b, T1.16b  )
 
ext T2.16b, XL.16b, XL.16b, #8
@@ -250,9 +259,18 @@ CPU_LE(rev64   T1.16b, T1.16b  )
eor T2.16b, T2.16b, XH.16b
eor XL.16b, XL.16b, T2.16b
 
-   cbnzw0, 0b
+   cbz w19, 3f
+
+   if_will_cond_yield_neon
+   st1 {XL.2d}, [x20]
+   do_cond_yield_neon
+   b   0b
+   endif_yield_neon
+
+   b   1b
 
-   st1 {XL.2d}, [x1]
+3: st1 {XL.2d}, [x20]
+   frame_pop
ret
.endm
 
@@ -304,38 +322,55 @@ ENDPROC(pmull_ghash_update_p8)
.endm
 
.macro  pmull_gcm_do_crypt, enc
-   ld1 {SHASH.2d}, [x4]
-   ld1 {XL.2d}, [x1]
-   ldr x8, [x5, #8]// load lower counter
+   frame_push  10
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+   mov x24, x5
+   mov x25, x6
+   mov x26, x7
+   .if \enc == 1
+   ldr x27, [sp, #96]  // first stacked arg
+   .endif
+
+   ldr x28, [x24, #8]  // load lower counter
+CPU_LE(rev x28, x28)
+
+0: mov x0, x25
+   load_round_keys w26, x0
+   ld1 {SHASH.2d}, [x23]
+   ld1 {XL.2d}, [x20]
 
moviMASK.16b, #0xe1
ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
-CPU_LE(rev x8, x8  )
shl MASK.2d, MASK.2d, #57
eor SHASH2.16b, SHASH2.16b, SHASH.16b
 
.if \enc == 1
-   ld1 {KS.16b}, [x7]
+   ld1 {KS.16b}, [x27]
.endif
 
-0: ld1 {CTR.8b}, [x5]  // load upper counter
-   ld1 {INP.16b}, [x3], #16
-   rev x9, x8
-   add x8, x8, #1
-   sub w0, w0, #1
+1: ld1 {CTR.8b}, [x24] // load upper counter
+   ld1 {INP.16b}, [x22], #16
+   rev x9, x28
+   add x28, x28, #1
+   sub w19, w19, #1
ins CTR.d[1], x9// set lower counter
 
.if \enc == 1
eor INP.16b, INP.16b, KS.16b// encrypt input
-   st1 {INP.16b}, [x2], #16
+   st1 {INP.16b}, [x21], #16
.endif
 
rev64   T1.16b, INP.16b
 
-   cmp w6, #12
-   b.ge2f  // AES-192/256?
+   cmp w26, #12
+   b.ge4f  // AES-192/256?
 
-1: enc_round   CTR, v21
+2: enc_round   CTR, v21
 
ext T2.16b, XL.16b, XL.16b, #8
ext IN1.16b, T1.16b, T1.16b, #8
@@ -390,27 +425,39 @@ CPU_LE(   rev x8, x8  )
 
.if \enc == 0
eor INP.16b, INP.16b, KS.16b
-   st1 {I

[PATCH resend 08/10] crypto: arm64/crct10dif-ce - yield NEON after every block of input

2018-04-30 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/crct10dif-ce-core.S | 32 +---
 1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/crypto/crct10dif-ce-core.S 
b/arch/arm64/crypto/crct10dif-ce-core.S
index f179c01bd55c..663ea71cdb38 100644
--- a/arch/arm64/crypto/crct10dif-ce-core.S
+++ b/arch/arm64/crypto/crct10dif-ce-core.S
@@ -74,13 +74,19 @@
.text
.cpugeneric+crypto
 
-   arg1_low32  .reqw0
-   arg2.reqx1
-   arg3.reqx2
+   arg1_low32  .reqw19
+   arg2.reqx20
+   arg3.reqx21
 
vzr .reqv13
 
 ENTRY(crc_t10dif_pmull)
+   frame_push  3, 128
+
+   mov arg1_low32, w0
+   mov arg2, x1
+   mov arg3, x2
+
movivzr.16b, #0 // init zero register
 
// adjust the 16-bit initial_crc value, scale it to 32 bits
@@ -175,8 +181,25 @@ CPU_LE(ext v12.16b, v12.16b, v12.16b, #8   
)
subsarg3, arg3, #128
 
// check if there is another 64B in the buffer to be able to fold
-   b.ge_fold_64_B_loop
+   b.lt_fold_64_B_end
+
+   if_will_cond_yield_neon
+   stp q0, q1, [sp, #.Lframe_local_offset]
+   stp q2, q3, [sp, #.Lframe_local_offset + 32]
+   stp q4, q5, [sp, #.Lframe_local_offset + 64]
+   stp q6, q7, [sp, #.Lframe_local_offset + 96]
+   do_cond_yield_neon
+   ldp q0, q1, [sp, #.Lframe_local_offset]
+   ldp q2, q3, [sp, #.Lframe_local_offset + 32]
+   ldp q4, q5, [sp, #.Lframe_local_offset + 64]
+   ldp q6, q7, [sp, #.Lframe_local_offset + 96]
+   ldr_l   q10, rk3, x8
+   movivzr.16b, #0 // init zero register
+   endif_yield_neon
+
+   b   _fold_64_B_loop
 
+_fold_64_B_end:
// at this point, the buffer pointer is pointing at the last y Bytes
// of the buffer the 64B of folded data is in 4 of the vector
// registers: v0, v1, v2, v3
@@ -304,6 +327,7 @@ _barrett:
 _cleanup:
// scale the result back to 16 bits
lsr x0, x0, #16
+   frame_pop
ret
 
 _less_than_128:
-- 
2.17.0

[PATCH resend 05/10] crypto: arm64/aes-bs - yield NEON after every block of input

2018-04-30 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/aes-neonbs-core.S | 305 +++-
 1 file changed, 170 insertions(+), 135 deletions(-)

diff --git a/arch/arm64/crypto/aes-neonbs-core.S 
b/arch/arm64/crypto/aes-neonbs-core.S
index ca0472500433..e613a87f8b53 100644
--- a/arch/arm64/crypto/aes-neonbs-core.S
+++ b/arch/arm64/crypto/aes-neonbs-core.S
@@ -565,54 +565,61 @@ ENDPROC(aesbs_decrypt8)
 *   int blocks)
 */
.macro  __ecb_crypt, do8, o0, o1, o2, o3, o4, o5, o6, o7
-   stp x29, x30, [sp, #-16]!
-   mov x29, sp
+   frame_push  5
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
 
 99:mov x5, #1
-   lsl x5, x5, x4
-   subsw4, w4, #8
-   cselx4, x4, xzr, pl
+   lsl x5, x5, x23
+   subsw23, w23, #8
+   cselx23, x23, xzr, pl
cselx5, x5, xzr, mi
 
-   ld1 {v0.16b}, [x1], #16
+   ld1 {v0.16b}, [x20], #16
tbnzx5, #1, 0f
-   ld1 {v1.16b}, [x1], #16
+   ld1 {v1.16b}, [x20], #16
tbnzx5, #2, 0f
-   ld1 {v2.16b}, [x1], #16
+   ld1 {v2.16b}, [x20], #16
tbnzx5, #3, 0f
-   ld1 {v3.16b}, [x1], #16
+   ld1 {v3.16b}, [x20], #16
tbnzx5, #4, 0f
-   ld1 {v4.16b}, [x1], #16
+   ld1 {v4.16b}, [x20], #16
tbnzx5, #5, 0f
-   ld1 {v5.16b}, [x1], #16
+   ld1 {v5.16b}, [x20], #16
tbnzx5, #6, 0f
-   ld1 {v6.16b}, [x1], #16
+   ld1 {v6.16b}, [x20], #16
tbnzx5, #7, 0f
-   ld1 {v7.16b}, [x1], #16
+   ld1 {v7.16b}, [x20], #16
 
-0: mov bskey, x2
-   mov rounds, x3
+0: mov bskey, x21
+   mov rounds, x22
bl  \do8
 
-   st1 {\o0\().16b}, [x0], #16
+   st1 {\o0\().16b}, [x19], #16
tbnzx5, #1, 1f
-   st1 {\o1\().16b}, [x0], #16
+   st1 {\o1\().16b}, [x19], #16
tbnzx5, #2, 1f
-   st1 {\o2\().16b}, [x0], #16
+   st1 {\o2\().16b}, [x19], #16
tbnzx5, #3, 1f
-   st1 {\o3\().16b}, [x0], #16
+   st1 {\o3\().16b}, [x19], #16
tbnzx5, #4, 1f
-   st1 {\o4\().16b}, [x0], #16
+   st1 {\o4\().16b}, [x19], #16
tbnzx5, #5, 1f
-   st1 {\o5\().16b}, [x0], #16
+   st1 {\o5\().16b}, [x19], #16
tbnzx5, #6, 1f
-   st1 {\o6\().16b}, [x0], #16
+   st1 {\o6\().16b}, [x19], #16
tbnzx5, #7, 1f
-   st1 {\o7\().16b}, [x0], #16
+   st1 {\o7\().16b}, [x19], #16
 
-   cbnzx4, 99b
+   cbz x23, 1f
+   cond_yield_neon
+   b   99b
 
-1: ldp x29, x30, [sp], #16
+1: frame_pop
ret
.endm
 
@@ -632,43 +639,49 @@ ENDPROC(aesbs_ecb_decrypt)
 */
.align  4
 ENTRY(aesbs_cbc_decrypt)
-   stp x29, x30, [sp, #-16]!
-   mov x29, sp
+   frame_push  6
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+   mov x24, x5
 
 99:mov x6, #1
-   lsl x6, x6, x4
-   subsw4, w4, #8
-   cselx4, x4, xzr, pl
+   lsl x6, x6, x23
+   subsw23, w23, #8
+   cselx23, x23, xzr, pl
cselx6, x6, xzr, mi
 
-   ld1 {v0.16b}, [x1], #16
+   ld1 {v0.16b}, [x20], #16
mov v25.16b, v0.16b
tbnzx6, #1, 0f
-   ld1 {v1.16b}, [x1], #16
+   ld1 {v1.16b}, [x20], #16
mov v26.16b, v1.16b
tbnzx6, #2, 0f
-   ld1 {v2.16b}, [x1], #16
+   ld1 {v2.16b}, [x20], #16
mov v27.16b, v2.16b
tbnzx6, #3, 0f
-   ld1 {v3.16b}, [x1], #16
+   ld1 {v3.16b}, [x20], #16
mov v28.16b, v3.16b
tbnzx6, #4, 0f
-

[PATCH resend 04/10] crypto: arm64/aes-blk - yield NEON after every block of input

2018-04-30 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/aes-ce.S|  15 +-
 arch/arm64/crypto/aes-modes.S | 331 
 2 files changed, 216 insertions(+), 130 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce.S b/arch/arm64/crypto/aes-ce.S
index 50330f5c3adc..623e74ed1c67 100644
--- a/arch/arm64/crypto/aes-ce.S
+++ b/arch/arm64/crypto/aes-ce.S
@@ -30,18 +30,21 @@
.endm
 
/* prepare for encryption with key in rk[] */
-   .macro  enc_prepare, rounds, rk, ignore
-   load_round_keys \rounds, \rk
+   .macro  enc_prepare, rounds, rk, temp
+   mov \temp, \rk
+   load_round_keys \rounds, \temp
.endm
 
/* prepare for encryption (again) but with new key in rk[] */
-   .macro  enc_switch_key, rounds, rk, ignore
-   load_round_keys \rounds, \rk
+   .macro  enc_switch_key, rounds, rk, temp
+   mov \temp, \rk
+   load_round_keys \rounds, \temp
.endm
 
/* prepare for decryption with key in rk[] */
-   .macro  dec_prepare, rounds, rk, ignore
-   load_round_keys \rounds, \rk
+   .macro  dec_prepare, rounds, rk, temp
+   mov \temp, \rk
+   load_round_keys \rounds, \temp
.endm
 
.macro  do_enc_Nx, de, mc, k, i0, i1, i2, i3
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index a68412e1e3a4..483a7130cf0e 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -14,12 +14,12 @@
.align  4
 
 aes_encrypt_block4x:
-   encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
+   encrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
ret
 ENDPROC(aes_encrypt_block4x)
 
 aes_decrypt_block4x:
-   decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
+   decrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
ret
 ENDPROC(aes_decrypt_block4x)
 
@@ -31,57 +31,71 @@ ENDPROC(aes_decrypt_block4x)
 */
 
 AES_ENTRY(aes_ecb_encrypt)
-   stp x29, x30, [sp, #-16]!
-   mov x29, sp
+   frame_push  5
 
-   enc_prepare w3, x2, x5
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+
+.Lecbencrestart:
+   enc_prepare w22, x21, x5
 
 .LecbencloopNx:
-   subsw4, w4, #4
+   subsw23, w23, #4
bmi .Lecbenc1x
-   ld1 {v0.16b-v3.16b}, [x1], #64  /* get 4 pt blocks */
+   ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
bl  aes_encrypt_block4x
-   st1 {v0.16b-v3.16b}, [x0], #64
+   st1 {v0.16b-v3.16b}, [x19], #64
+   cond_yield_neon .Lecbencrestart
b   .LecbencloopNx
 .Lecbenc1x:
-   addsw4, w4, #4
+   addsw23, w23, #4
beq .Lecbencout
 .Lecbencloop:
-   ld1 {v0.16b}, [x1], #16 /* get next pt block */
-   encrypt_block   v0, w3, x2, x5, w6
-   st1 {v0.16b}, [x0], #16
-   subsw4, w4, #1
+   ld1 {v0.16b}, [x20], #16/* get next pt block */
+   encrypt_block   v0, w22, x21, x5, w6
+   st1 {v0.16b}, [x19], #16
+   subsw23, w23, #1
bne .Lecbencloop
 .Lecbencout:
-   ldp x29, x30, [sp], #16
+   frame_pop
ret
 AES_ENDPROC(aes_ecb_encrypt)
 
 
 AES_ENTRY(aes_ecb_decrypt)
-   stp x29, x30, [sp, #-16]!
-   mov x29, sp
+   frame_push  5
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
 
-   dec_prepare w3, x2, x5
+.Lecbdecrestart:
+   dec_prepare w22, x21, x5
 
 .LecbdecloopNx:
-   subsw4, w4, #4
+   subsw23, w23, #4
bmi .Lecbdec1x
-   ld1 {v0.16b-v3.16b}, [x1], #64  /* get 4 ct blocks */
+   ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
bl  aes_decrypt_block4x
-   st1 {v0.16b-v3.16b}, [x0], #64
+   st1 {v0.16b-v3.16b}, [x19], #64
+   cond_yield_neon .Lecbdecrestart
b   .LecbdecloopNx
 .Lecbdec1x:
-   addsw4, w4, #4
+   addsw23, w23, #4
beq .Lecbdecout
 .Lecbdecloop:
-   ld1 {v0.16b}, [x1], #16 /* get next ct block */
-   decrypt_block   v0, w3, x2, x5, w6
-   st1 {v0.16b}, [x0], #16
-   subsw4, w4, #1
+

[PATCH resend 03/10] crypto: arm64/aes-ccm - yield NEON after every block of input

2018-04-30 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/aes-ce-ccm-core.S | 150 +---
 1 file changed, 95 insertions(+), 55 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S 
b/arch/arm64/crypto/aes-ce-ccm-core.S
index e3a375c4cb83..88f5aef7934c 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -19,24 +19,33 @@
 *   u32 *macp, u8 const rk[], u32 rounds);
 */
 ENTRY(ce_aes_ccm_auth_data)
-   ldr w8, [x3]/* leftover from prev round? */
+   frame_push  7
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+   mov x24, x5
+
+   ldr w25, [x22]  /* leftover from prev round? */
ld1 {v0.16b}, [x0]  /* load mac */
-   cbz w8, 1f
-   sub w8, w8, #16
+   cbz w25, 1f
+   sub w25, w25, #16
eor v1.16b, v1.16b, v1.16b
-0: ldrbw7, [x1], #1/* get 1 byte of input */
-   subsw2, w2, #1
-   add w8, w8, #1
+0: ldrbw7, [x20], #1   /* get 1 byte of input */
+   subsw21, w21, #1
+   add w25, w25, #1
ins v1.b[0], w7
ext v1.16b, v1.16b, v1.16b, #1  /* rotate in the input bytes */
beq 8f  /* out of input? */
-   cbnzw8, 0b
+   cbnzw25, 0b
eor v0.16b, v0.16b, v1.16b
-1: ld1 {v3.4s}, [x4]   /* load first round key */
-   prfmpldl1strm, [x1]
-   cmp w5, #12 /* which key size? */
-   add x6, x4, #16
-   sub w7, w5, #2  /* modified # of rounds */
+1: ld1 {v3.4s}, [x23]  /* load first round key */
+   prfmpldl1strm, [x20]
+   cmp w24, #12/* which key size? */
+   add x6, x23, #16
+   sub w7, w24, #2 /* modified # of rounds */
bmi 2f
bne 5f
mov v5.16b, v3.16b
@@ -55,33 +64,43 @@ ENTRY(ce_aes_ccm_auth_data)
ld1 {v5.4s}, [x6], #16  /* load next round key */
bpl 3b
aesev0.16b, v4.16b
-   subsw2, w2, #16 /* last data? */
+   subsw21, w21, #16   /* last data? */
eor v0.16b, v0.16b, v5.16b  /* final round */
bmi 6f
-   ld1 {v1.16b}, [x1], #16 /* load next input block */
+   ld1 {v1.16b}, [x20], #16/* load next input block */
eor v0.16b, v0.16b, v1.16b  /* xor with mac */
-   bne 1b
-6: st1 {v0.16b}, [x0]  /* store mac */
+   beq 6f
+
+   if_will_cond_yield_neon
+   st1 {v0.16b}, [x19] /* store mac */
+   do_cond_yield_neon
+   ld1 {v0.16b}, [x19] /* reload mac */
+   endif_yield_neon
+
+   b   1b
+6: st1 {v0.16b}, [x19] /* store mac */
beq 10f
-   addsw2, w2, #16
+   addsw21, w21, #16
beq 10f
-   mov w8, w2
-7: ldrbw7, [x1], #1
+   mov w25, w21
+7: ldrbw7, [x20], #1
umovw6, v0.b[0]
eor w6, w6, w7
-   strbw6, [x0], #1
-   subsw2, w2, #1
+   strbw6, [x19], #1
+   subsw21, w21, #1
beq 10f
ext v0.16b, v0.16b, v0.16b, #1  /* rotate out the mac bytes */
b   7b
-8: mov w7, w8
-   add w8, w8, #16
+8: mov w7, w25
+   add w25, w25, #16
 9: ext v1.16b, v1.16b, v1.16b, #1
addsw7, w7, #1
bne 9b
eor v0.16b, v0.16b, v1.16b
-   st1 {v0.16b}, [x0]
-10:str w8, [x3]
+   st1 {v0.16b}, [x19]
+10:str w25, [x22]
+
+   frame_pop
ret
 ENDPROC(ce_aes_ccm_auth_data)
 
@@ -126,19 +145,29 @@ ENTRY(ce_aes_ccm_final)
 ENDPROC(ce_aes_ccm_final)
 
.macro  aes_ccm_do_crypt,enc
-   ldr x8, [x6, #8]/* load lower ctr */
-   ld1 {v0.16b}, [x5]  /* load mac */
-CPU_LE(rev x8, x8  )   /* keep swabbed ctr in 
reg */
+   frame_push  8
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+   mov x24, x5
+   mov x25, x6
+
+   ldr x26, [x25, #8]  /* load lower ctr */
+   ld1 {v0.16b}, [x24] /* load mac */
+CPU_LE(rev x26, x26)   /* keep swabbed ctr in 
reg

[PATCH resend 02/10] crypto: arm64/sha2-ce - yield NEON after every block of input

2018-04-30 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/sha2-ce-core.S | 37 ++--
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/crypto/sha2-ce-core.S b/arch/arm64/crypto/sha2-ce-core.S
index 4c3c89b812ce..cd8b36412469 100644
--- a/arch/arm64/crypto/sha2-ce-core.S
+++ b/arch/arm64/crypto/sha2-ce-core.S
@@ -79,30 +79,36 @@
 */
.text
 ENTRY(sha2_ce_transform)
+   frame_push  3
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+
/* load round constants */
-   adr_l   x8, .Lsha2_rcon
+0: adr_l   x8, .Lsha2_rcon
ld1 { v0.4s- v3.4s}, [x8], #64
ld1 { v4.4s- v7.4s}, [x8], #64
ld1 { v8.4s-v11.4s}, [x8], #64
ld1 {v12.4s-v15.4s}, [x8]
 
/* load state */
-   ld1 {dgav.4s, dgbv.4s}, [x0]
+   ld1 {dgav.4s, dgbv.4s}, [x19]
 
/* load sha256_ce_state::finalize */
ldr_l   w4, sha256_ce_offsetof_finalize, x4
-   ldr w4, [x0, x4]
+   ldr w4, [x19, x4]
 
/* load input */
-0: ld1 {v16.4s-v19.4s}, [x1], #64
-   sub w2, w2, #1
+1: ld1 {v16.4s-v19.4s}, [x20], #64
+   sub w21, w21, #1
 
 CPU_LE(rev32   v16.16b, v16.16b)
 CPU_LE(rev32   v17.16b, v17.16b)
 CPU_LE(rev32   v18.16b, v18.16b)
 CPU_LE(rev32   v19.16b, v19.16b)
 
-1: add t0.4s, v16.4s, v0.4s
+2: add t0.4s, v16.4s, v0.4s
mov dg0v.16b, dgav.16b
mov dg1v.16b, dgbv.16b
 
@@ -131,16 +137,24 @@ CPU_LE(   rev32   v19.16b, v19.16b)
add dgbv.4s, dgbv.4s, dg1v.4s
 
/* handled all input blocks? */
-   cbnzw2, 0b
+   cbz w21, 3f
+
+   if_will_cond_yield_neon
+   st1 {dgav.4s, dgbv.4s}, [x19]
+   do_cond_yield_neon
+   b   0b
+   endif_yield_neon
+
+   b   1b
 
/*
 * Final block: add padding and total bit count.
 * Skip if the input size was not a round multiple of the block size,
 * the padding is handled by the C code in that case.
 */
-   cbz x4, 3f
+3: cbz x4, 4f
ldr_l   w4, sha256_ce_offsetof_count, x4
-   ldr x4, [x0, x4]
+   ldr x4, [x19, x4]
moviv17.2d, #0
mov x8, #0x8000
moviv18.2d, #0
@@ -149,9 +163,10 @@ CPU_LE(rev32   v19.16b, v19.16b)
mov x4, #0
mov v19.d[0], xzr
mov v19.d[1], x7
-   b   1b
+   b   2b
 
/* store new state */
-3: st1 {dgav.4s, dgbv.4s}, [x0]
+4: st1 {dgav.4s, dgbv.4s}, [x19]
+   frame_pop
ret
 ENDPROC(sha2_ce_transform)
-- 
2.17.0

[PATCH resend 00/10] crypto: arm64 - play nice with CONFIG_PREEMPT

2018-04-30 Thread Ard Biesheuvel

Hello Herbert,

These are the patches that depend on the arm64/assembler.h patches that
inadvertently got pulled into the cryptodev tree and reverted shortly
after. Those have now been merged into Linus's tree, and so the
remaining changes can be applied as well. Please apply.

Ard Biesheuvel (10):
  crypto: arm64/sha1-ce - yield NEON after every block of input
  crypto: arm64/sha2-ce - yield NEON after every block of input
  crypto: arm64/aes-ccm - yield NEON after every block of input
  crypto: arm64/aes-blk - yield NEON after every block of input
  crypto: arm64/aes-bs - yield NEON after every block of input
  crypto: arm64/aes-ghash - yield NEON after every block of input
  crypto: arm64/crc32-ce - yield NEON after every block of input
  crypto: arm64/crct10dif-ce - yield NEON after every block of input
  crypto: arm64/sha3-ce - yield NEON after every block of input
  crypto: arm64/sha512-ce - yield NEON after every block of input

 arch/arm64/crypto/aes-ce-ccm-core.S   | 150 +
 arch/arm64/crypto/aes-ce.S|  15 +-
 arch/arm64/crypto/aes-modes.S | 331 
 arch/arm64/crypto/aes-neonbs-core.S   | 305 ++
 arch/arm64/crypto/crc32-ce-core.S |  40 ++-
 arch/arm64/crypto/crct10dif-ce-core.S |  32 +-
 arch/arm64/crypto/ghash-ce-core.S | 113 +--
 arch/arm64/crypto/ghash-ce-glue.c |  28 +-
 arch/arm64/crypto/sha1-ce-core.S  |  42 ++-
 arch/arm64/crypto/sha2-ce-core.S  |  37 ++-
 arch/arm64/crypto/sha3-ce-core.S  |  77 +++--
 arch/arm64/crypto/sha512-ce-core.S|  27 +-
 12 files changed, 762 insertions(+), 435 deletions(-)

-- 
2.17.0

[PATCH resend 01/10] crypto: arm64/sha1-ce - yield NEON after every block of input

2018-04-30 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/sha1-ce-core.S | 42 ++--
 1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/crypto/sha1-ce-core.S b/arch/arm64/crypto/sha1-ce-core.S
index 46049850727d..78eb35fb5056 100644
--- a/arch/arm64/crypto/sha1-ce-core.S
+++ b/arch/arm64/crypto/sha1-ce-core.S
@@ -69,30 +69,36 @@
 *int blocks)
 */
 ENTRY(sha1_ce_transform)
+   frame_push  3
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+
/* load round constants */
-   loadrc  k0.4s, 0x5a827999, w6
+0: loadrc  k0.4s, 0x5a827999, w6
loadrc  k1.4s, 0x6ed9eba1, w6
loadrc  k2.4s, 0x8f1bbcdc, w6
loadrc  k3.4s, 0xca62c1d6, w6
 
/* load state */
-   ld1 {dgav.4s}, [x0]
-   ldr dgb, [x0, #16]
+   ld1 {dgav.4s}, [x19]
+   ldr dgb, [x19, #16]
 
/* load sha1_ce_state::finalize */
ldr_l   w4, sha1_ce_offsetof_finalize, x4
-   ldr w4, [x0, x4]
+   ldr w4, [x19, x4]
 
/* load input */
-0: ld1 {v8.4s-v11.4s}, [x1], #64
-   sub w2, w2, #1
+1: ld1 {v8.4s-v11.4s}, [x20], #64
+   sub w21, w21, #1
 
 CPU_LE(rev32   v8.16b, v8.16b  )
 CPU_LE(rev32   v9.16b, v9.16b  )
 CPU_LE(rev32   v10.16b, v10.16b)
 CPU_LE(rev32   v11.16b, v11.16b)
 
-1: add t0.4s, v8.4s, k0.4s
+2: add t0.4s, v8.4s, k0.4s
mov dg0v.16b, dgav.16b
 
add_update  c, ev, k0,  8,  9, 10, 11, dgb
@@ -123,16 +129,25 @@ CPU_LE(   rev32   v11.16b, v11.16b)
add dgbv.2s, dgbv.2s, dg1v.2s
add dgav.4s, dgav.4s, dg0v.4s
 
-   cbnzw2, 0b
+   cbz w21, 3f
+
+   if_will_cond_yield_neon
+   st1 {dgav.4s}, [x19]
+   str dgb, [x19, #16]
+   do_cond_yield_neon
+   b   0b
+   endif_yield_neon
+
+   b   1b
 
/*
 * Final block: add padding and total bit count.
 * Skip if the input size was not a round multiple of the block size,
 * the padding is handled by the C code in that case.
 */
-   cbz x4, 3f
+3: cbz x4, 4f
ldr_l   w4, sha1_ce_offsetof_count, x4
-   ldr x4, [x0, x4]
+   ldr x4, [x19, x4]
moviv9.2d, #0
mov x8, #0x8000
moviv10.2d, #0
@@ -141,10 +156,11 @@ CPU_LE(   rev32   v11.16b, v11.16b)
mov x4, #0
mov v11.d[0], xzr
mov v11.d[1], x7
-   b   1b
+   b   2b
 
/* store new state */
-3: st1 {dgav.4s}, [x0]
-   str dgb, [x0, #16]
+4: st1 {dgav.4s}, [x19]
+   str dgb, [x19, #16]
+   frame_pop
ret
 ENDPROC(sha1_ce_transform)
-- 
2.17.0

Re: [PATCH 1/2] crypto: sm4 - export encrypt/decrypt routines to other drivers

2018-04-25 Thread Ard Biesheuvel

On 25 April 2018 at 14:20, Ard Biesheuvel <ard.biesheu...@linaro.org> wrote:
> In preparation of adding support for the SIMD based arm64 implementation
> of arm64,

SM4 ^^^

> which requires a fallback to non-SIMD code when invoked in
> certain contexts, expose the generic SM4 encrypt and decrypt routines
> to other drivers.
>
> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
> ---
>  crypto/sm4_generic.c | 10 ++
>  include/crypto/sm4.h |  3 +++
>  2 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/crypto/sm4_generic.c b/crypto/sm4_generic.c
> index f537a2766c55..c18eebfd5edd 100644
> --- a/crypto/sm4_generic.c
> +++ b/crypto/sm4_generic.c
> @@ -190,21 +190,23 @@ static void sm4_do_crypt(const u32 *rk, u32 *out, const 
> u32 *in)
>
>  /* encrypt a block of text */
>
> -static void sm4_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
> +void crypto_sm4_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
>  {
> const struct crypto_sm4_ctx *ctx = crypto_tfm_ctx(tfm);
>
> sm4_do_crypt(ctx->rkey_enc, (u32 *)out, (u32 *)in);
>  }
> +EXPORT_SYMBOL_GPL(crypto_sm4_encrypt);
>
>  /* decrypt a block of text */
>
> -static void sm4_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
> +void crypto_sm4_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
>  {
> const struct crypto_sm4_ctx *ctx = crypto_tfm_ctx(tfm);
>
> sm4_do_crypt(ctx->rkey_dec, (u32 *)out, (u32 *)in);
>  }
> +EXPORT_SYMBOL_GPL(crypto_sm4_decrypt);
>
>  static struct crypto_alg sm4_alg = {
> .cra_name   =   "sm4",
> @@ -219,8 +221,8 @@ static struct crypto_alg sm4_alg = {
> .cia_min_keysize=   SM4_KEY_SIZE,
> .cia_max_keysize=   SM4_KEY_SIZE,
> .cia_setkey =   crypto_sm4_set_key,
> -   .cia_encrypt=   sm4_encrypt,
> -   .cia_decrypt=   sm4_decrypt
> +   .cia_encrypt=   crypto_sm4_encrypt,
> +   .cia_decrypt=   crypto_sm4_decrypt
> }
> }
>  };
> diff --git a/include/crypto/sm4.h b/include/crypto/sm4.h
> index b64e64d20b28..7afd730d16ff 100644
> --- a/include/crypto/sm4.h
> +++ b/include/crypto/sm4.h
> @@ -25,4 +25,7 @@ int crypto_sm4_set_key(struct crypto_tfm *tfm, const u8 
> *in_key,
>  int crypto_sm4_expand_key(struct crypto_sm4_ctx *ctx, const u8 *in_key,
>   unsigned int key_len);
>
> +void crypto_sm4_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in);
> +void crypto_sm4_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in);
> +
>  #endif
> --
> 2.17.0
>

[PATCH 1/2] crypto: sm4 - export encrypt/decrypt routines to other drivers

2018-04-25 Thread Ard Biesheuvel

In preparation of adding support for the SIMD based arm64 implementation
of arm64, which requires a fallback to non-SIMD code when invoked in
certain contexts, expose the generic SM4 encrypt and decrypt routines
to other drivers.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 crypto/sm4_generic.c | 10 ++
 include/crypto/sm4.h |  3 +++
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/crypto/sm4_generic.c b/crypto/sm4_generic.c
index f537a2766c55..c18eebfd5edd 100644
--- a/crypto/sm4_generic.c
+++ b/crypto/sm4_generic.c
@@ -190,21 +190,23 @@ static void sm4_do_crypt(const u32 *rk, u32 *out, const 
u32 *in)
 
 /* encrypt a block of text */
 
-static void sm4_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+void crypto_sm4_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
 {
const struct crypto_sm4_ctx *ctx = crypto_tfm_ctx(tfm);
 
sm4_do_crypt(ctx->rkey_enc, (u32 *)out, (u32 *)in);
 }
+EXPORT_SYMBOL_GPL(crypto_sm4_encrypt);
 
 /* decrypt a block of text */
 
-static void sm4_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+void crypto_sm4_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
 {
const struct crypto_sm4_ctx *ctx = crypto_tfm_ctx(tfm);
 
sm4_do_crypt(ctx->rkey_dec, (u32 *)out, (u32 *)in);
 }
+EXPORT_SYMBOL_GPL(crypto_sm4_decrypt);
 
 static struct crypto_alg sm4_alg = {
.cra_name   =   "sm4",
@@ -219,8 +221,8 @@ static struct crypto_alg sm4_alg = {
.cia_min_keysize=   SM4_KEY_SIZE,
.cia_max_keysize=   SM4_KEY_SIZE,
.cia_setkey =   crypto_sm4_set_key,
-   .cia_encrypt=   sm4_encrypt,
-   .cia_decrypt=   sm4_decrypt
+   .cia_encrypt=   crypto_sm4_encrypt,
+   .cia_decrypt=   crypto_sm4_decrypt
}
}
 };
diff --git a/include/crypto/sm4.h b/include/crypto/sm4.h
index b64e64d20b28..7afd730d16ff 100644
--- a/include/crypto/sm4.h
+++ b/include/crypto/sm4.h
@@ -25,4 +25,7 @@ int crypto_sm4_set_key(struct crypto_tfm *tfm, const u8 
*in_key,
 int crypto_sm4_expand_key(struct crypto_sm4_ctx *ctx, const u8 *in_key,
  unsigned int key_len);
 
+void crypto_sm4_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in);
+void crypto_sm4_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in);
+
 #endif
-- 
2.17.0

[PATCH 2/2] crypto: arm64 - add support for SM4 encryption using special instructions

2018-04-25 Thread Ard Biesheuvel

Add support for the SM4 symmetric cipher implemented using the special
SM4 instructions introduced in ARM architecture revision 8.2.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/Kconfig   |  6 ++
 arch/arm64/crypto/Makefile  |  3 +
 arch/arm64/crypto/sm4-ce-core.S | 36 ++
 arch/arm64/crypto/sm4-ce-glue.c | 73 
 4 files changed, 118 insertions(+)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index cb5a243110c4..e3fdb0fd6f70 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -47,6 +47,12 @@ config CRYPTO_SM3_ARM64_CE
select CRYPTO_HASH
select CRYPTO_SM3
 
+config CRYPTO_SM4_ARM64_CE
+   tristate "SM4 symmetric cipher (ARMv8.2 Crypto Extensions)"
+   depends on KERNEL_MODE_NEON
+   select CRYPTO_ALGAPI
+   select CRYPTO_SM4
+
 config CRYPTO_GHASH_ARM64_CE
tristate "GHASH/AES-GCM using ARMv8 Crypto Extensions"
depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index f35ac684b1c0..bcafd016618e 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -23,6 +23,9 @@ sha3-ce-y := sha3-ce-glue.o sha3-ce-core.o
 obj-$(CONFIG_CRYPTO_SM3_ARM64_CE) += sm3-ce.o
 sm3-ce-y := sm3-ce-glue.o sm3-ce-core.o
 
+obj-$(CONFIG_CRYPTO_SM4_ARM64_CE) += sm4-ce.o
+sm4-ce-y := sm4-ce-glue.o sm4-ce-core.o
+
 obj-$(CONFIG_CRYPTO_GHASH_ARM64_CE) += ghash-ce.o
 ghash-ce-y := ghash-ce-glue.o ghash-ce-core.o
 
diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
new file mode 100644
index ..af3bfbc3f4d4
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+
+   .irpb, 0, 1, 2, 3, 4, 5, 6, 7, 8
+   .set.Lv\b\().4s, \b
+   .endr
+
+   .macro  sm4e, rd, rn
+   .inst   0xcec08400 | .L\rd | (.L\rn << 5)
+   .endm
+
+   /*
+* void sm4_ce_do_crypt(const u32 *rk, u32 *out, const u32 *in);
+*/
+   .text
+ENTRY(sm4_ce_do_crypt)
+   ld1 {v8.4s}, [x2]
+   ld1 {v0.4s-v3.4s}, [x0], #64
+CPU_LE(rev32   v8.16b, v8.16b  )
+   ld1 {v4.4s-v7.4s}, [x0]
+   sm4ev8.4s, v0.4s
+   sm4ev8.4s, v1.4s
+   sm4ev8.4s, v2.4s
+   sm4ev8.4s, v3.4s
+   sm4ev8.4s, v4.4s
+   sm4ev8.4s, v5.4s
+   sm4ev8.4s, v6.4s
+   sm4ev8.4s, v7.4s
+   rev64   v8.4s, v8.4s
+   ext v8.16b, v8.16b, v8.16b, #8
+CPU_LE(rev32   v8.16b, v8.16b  )
+   st1 {v8.4s}, [x1]
+   ret
+ENDPROC(sm4_ce_do_crypt)
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
new file mode 100644
index ..b7fb5274b250
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -0,0 +1,73 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_ALIAS_CRYPTO("sm4");
+MODULE_ALIAS_CRYPTO("sm4-ce");
+MODULE_DESCRIPTION("SM4 symmetric cipher using ARMv8 Crypto Extensions");
+MODULE_AUTHOR("Ard Biesheuvel <ard.biesheu...@linaro.org>");
+MODULE_LICENSE("GPL v2");
+
+asmlinkage void sm4_ce_do_crypt(const u32 *rk, void *out, const void *in);
+
+static void sm4_ce_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+{
+   const struct crypto_sm4_ctx *ctx = crypto_tfm_ctx(tfm);
+
+   if (!may_use_simd()) {
+   crypto_sm4_encrypt(tfm, out, in);
+   } else {
+   kernel_neon_begin();
+   sm4_ce_do_crypt(ctx->rkey_enc, out, in);
+   kernel_neon_end();
+   }
+}
+
+static void sm4_ce_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+{
+   const struct crypto_sm4_ctx *ctx = crypto_tfm_ctx(tfm);
+
+   if (!may_use_simd()) {
+   crypto_sm4_decrypt(tfm, out, in);
+   } else {
+   kernel_neon_begin();
+   sm4_ce_do_crypt(ctx->rkey_dec, out, in);
+   kernel_neon_end();
+   }
+}
+
+static struct crypto_alg sm4_ce_alg = {
+   .cra_name   = "sm4",
+   .cra_driver_name= "sm4-ce",
+   .cra_priority   = 200,
+   .cra_flags  = CRYPTO_ALG_TYPE_CIPHER,
+   .cra_blocksize  = SM4_BLOCK_SIZE,
+   .cra_ctxsize= sizeof(struct crypto_sm4_ctx),
+   .cra_module = THIS_MODULE,
+   .cra_u.cipher = {
+   .cia_min_keysize= SM4_KEY_SIZE,
+   .cia_max_keysize= SM4_KEY_SIZE,
+   .cia_setkey

[PATCH 0/2] crypto: implement SM4 for arm64 using special instructions

2018-04-25 Thread Ard Biesheuvel

Patch #1 makes some preparatory changes so the C routines can be used as
a fallback by other drivers.

Patch #2 implements the SM4 core cipher using the special instructions
introduced as an optional extension by revision 8.2 of the ARM architecture.

Note that this does not implement cipher+chaining mode combinations as we
do for AES. This can be added later if desiresd.

Ard Biesheuvel (2):
  crypto: sm4 - export encrypt/decrypt routines to other drivers
  crypto: arm64 - add support for SM4 encryption using special
instructions

 arch/arm64/crypto/Kconfig   |  6 ++
 arch/arm64/crypto/Makefile  |  3 +
 arch/arm64/crypto/sm4-ce-core.S | 36 ++
 arch/arm64/crypto/sm4-ce-glue.c | 73 
 crypto/sm4_generic.c| 10 +--
 include/crypto/sm4.h|  3 +
 6 files changed, 127 insertions(+), 4 deletions(-)
 create mode 100644 arch/arm64/crypto/sm4-ce-core.S
 create mode 100644 arch/arm64/crypto/sm4-ce-glue.c

-- 
2.17.0

Re: [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT

2018-03-19 Thread Ard Biesheuvel

On 16 March 2018 at 23:57, Herbert Xu <herb...@gondor.apana.org.au> wrote:
> On Sat, Mar 10, 2018 at 03:21:45PM +0000, Ard Biesheuvel wrote:
>> As reported by Sebastian, the way the arm64 NEON crypto code currently
>> keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
>> causing problems with RT builds, given that the skcipher walk API may
>> allocate and free temporary buffers it uses to present the input and
>> output arrays to the crypto algorithm in blocksize sized chunks (where
>> blocksize is the natural blocksize of the crypto algorithm), and doing
>> so with NEON enabled means we're alloc/free'ing memory with preemption
>> disabled.
>>
>> This was deliberate: when this code was introduced, each kernel_neon_begin()
>> and kernel_neon_end() call incurred a fixed penalty of storing resp.
>> loading the contents of all NEON registers to/from memory, and so doing
>> it less often had an obvious performance benefit. However, in the mean time,
>> we have refactored the core kernel mode NEON code, and now 
>> kernel_neon_begin()
>> only incurs this penalty the first time it is called after entering the 
>> kernel,
>> and the NEON register restore is deferred until returning to userland. This
>> means pulling those calls into the loops that iterate over the input/output
>> of the crypto algorithm is not a big deal anymore (although there are some
>> places in the code where we relied on the NEON registers retaining their
>> values between calls)
>>
>> So let's clean this up for arm64: update the NEON based skcipher drivers to
>> no longer keep the NEON enabled when calling into the skcipher walk API.
>>
>> As pointed out by Peter, this only solves part of the problem. So let's
>> tackle it more thoroughly, and update the algorithms to test the NEED_RESCHED
>> flag each time after processing a fixed chunk of input.
>>
>> Given that this issue was flagged by the RT people, I would appreciate it
>> if they could confirm whether they are happy with this approach.
>>
>> Changes since v4:
>> - rebase onto v4.16-rc3
>> - apply the same treatment to new SHA512, SHA-3 and SM3 code that landed
>>   in v4.16-rc1
>
> Looks good to me.  If more work is needed we can always do
> incremental fixes.
>
> Patches 1-22 applied.  Thanks.

Thanks Herbert.

Apologies if this wasn't clear, but there are some cross dependencies
with the arm64 tree, which receives non-trivial modifications in
patches 10 and 11, which are subsequently depended upon by patches 12
- 23.

Without acks from them, we should really not be merging this code yet,
especially because I noticed a rebase issue in patch #10 (my bad).

Would you mind reverting 10 - 22? I will revisit this asap, and try to
get acks for the arm64 patches. If that means waiting for the next
cycle, so be it.

Thanks,
Ard.

Re: [PATCH] crypto: arm,arm64 - Fix random regeneration of S_shipped

2018-03-14 Thread Ard Biesheuvel

On 14 March 2018 at 02:31, Masahiro Yamada
<yamada.masah...@socionext.com> wrote:
> 2018-03-14 5:17 GMT+09:00 Leonard Crestez <leonard.cres...@nxp.com>:
>> The decision to rebuild .S_shipped is made based on the relative
>> timestamps of .S_shipped and .pl files but git makes this essentially
>> random. This means that the perl script might run anyway (usually at
>> most once per checkout), defeating the whole purpose of _shipped.
>>
>> Fix by skipping the rule unless explicit make variables are provided:
>> REGENERATE_ARM_CRYPTO or REGENERATE_ARM64_CRYPTO.
>>
>> This can produce nasty occasional build failures downstream, for example
>> for toolchains with broken perl. The solution is minimally intrusive to
>> make it easier to push into stable.
>>
>> Another report on a similar issue here: https://lkml.org/lkml/2018/3/8/1379
>>
>> Signed-off-by: Leonard Crestez <leonard.cres...@nxp.com>
>> Cc: <sta...@vger.kernel.org>
>> ---
>
>
>
> Reviewed-by: Masahiro Yamada <yamada.masah...@socionext.com>
>

Acked-by: Ard Biesheuvel <ard.biesheu...@linaro.org>

>
>
>>  arch/arm/crypto/Makefile   | 2 ++
>>  arch/arm64/crypto/Makefile | 2 ++
>>  2 files changed, 4 insertions(+)
>>
>> Not clear if this needs to go through crypto or arm but all commits in these
>> directories start with "crypto:".
>>
>> My problems were only on arm64 because of a yocto toolchain which ships a 
>> version
>> of perl which fails on "use integer;".
>>
>> CC stable because this can cause trouble for downstream packagers.
>>
>> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
>> index 30ef8e2..c9919c2 100644
>> --- a/arch/arm/crypto/Makefile
>> +++ b/arch/arm/crypto/Makefile
>> @@ -47,20 +47,22 @@ sha256-arm-y:= sha256-core.o sha256_glue.o 
>> $(sha256-arm-neon-y)
>>  sha512-arm-neon-$(CONFIG_KERNEL_MODE_NEON) := sha512-neon-glue.o
>>  sha512-arm-y   := sha512-core.o sha512-glue.o $(sha512-arm-neon-y)
>>  sha1-arm-ce-y  := sha1-ce-core.o sha1-ce-glue.o
>>  sha2-arm-ce-y  := sha2-ce-core.o sha2-ce-glue.o
>>  aes-arm-ce-y   := aes-ce-core.o aes-ce-glue.o
>>  ghash-arm-ce-y := ghash-ce-core.o ghash-ce-glue.o
>>  crct10dif-arm-ce-y := crct10dif-ce-core.o crct10dif-ce-glue.o
>>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
>>
>> +ifdef REGENERATE_ARM_CRYPTO
>>  quiet_cmd_perl = PERL$@
>>cmd_perl = $(PERL) $(<) > $(@)
>>
>>  $(src)/sha256-core.S_shipped: $(src)/sha256-armv4.pl
>> $(call cmd,perl)
>>
>>  $(src)/sha512-core.S_shipped: $(src)/sha512-armv4.pl
>> $(call cmd,perl)
>> +endif
>>
>>  .PRECIOUS: $(obj)/sha256-core.S $(obj)/sha512-core.S
>> diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
>> index cee9b8d9..dfe651b 100644
>> --- a/arch/arm64/crypto/Makefile
>> +++ b/arch/arm64/crypto/Makefile
>> @@ -60,20 +60,22 @@ obj-$(CONFIG_CRYPTO_AES_ARM64_BS) += aes-neon-bs.o
>>  aes-neon-bs-y := aes-neonbs-core.o aes-neonbs-glue.o
>>
>>  AFLAGS_aes-ce.o:= -DINTERLEAVE=4
>>  AFLAGS_aes-neon.o  := -DINTERLEAVE=4
>>
>>  CFLAGS_aes-glue-ce.o   := -DUSE_V8_CRYPTO_EXTENSIONS
>>
>>  $(obj)/aes-glue-%.o: $(src)/aes-glue.c FORCE
>> $(call if_changed_rule,cc_o_c)
>>
>> +ifdef REGENERATE_ARM64_CRYPTO
>>  quiet_cmd_perlasm = PERLASM $@
>>cmd_perlasm = $(PERL) $(<) void $(@)
>>
>>  $(src)/sha256-core.S_shipped: $(src)/sha512-armv8.pl
>> $(call cmd,perlasm)
>>
>>  $(src)/sha512-core.S_shipped: $(src)/sha512-armv8.pl
>> $(call cmd,perlasm)
>> +endif
>>
>>  .PRECIOUS: $(obj)/sha256-core.S $(obj)/sha512-core.S
>> --
>> 2.7.4
>>
>
>
>
> --
> Best Regards
> Masahiro Yamada

Re: what is a replacement for private_AES_set_encrypt_key and AES_encrypt functions

2018-03-12 Thread Ard Biesheuvel

On 12 March 2018 at 14:38, Vitaly Andrianov  wrote:
> Hello,
>
> The Texas Instruments keystone2 out-of-tree kernel uses the
> private_AES_set_encrypt_key() and
> the AES_encrypt() at the crypto HW acceleration driver.
>
> The "crypto: arm/aes - replace bit-sliced OpenSSL NEON code" commit removed
> those functions.
> Here is a code, which calls the removed functions.
>
> static inline int sa_aes_xcbc_subkey(u8 *sub_key1, u8 *sub_key2,
>  u8 *sub_key3, const u8 *key,
>  u16 key_sz)
> {
> struct AES_KEY enc_key;
>
> if (private_AES_set_encrypt_key(key, (key_sz * 8), _key)) {
> pr_err("%s: failed to set enc key\n", __func__);
> return -EINVAL;
> }
>
> if (sub_key1) {
> memset(sub_key1, 0x01, AES_BLOCK_SIZE);
> AES_encrypt(sub_key1, sub_key1, _key);
> }
>
> if (sub_key2) {
> memset(sub_key2, 0x02, AES_BLOCK_SIZE);
> AES_encrypt(sub_key2, sub_key2, _key);
> }
>
> if (sub_key3) {
> memset(sub_key3, 0x03, AES_BLOCK_SIZE);
> AES_encrypt(sub_key3, sub_key3, _key);
> }
>
> return 0;
> }
>
> Which functions can I use to replace the removed ones in the above code?
>

Look at xcbc_setkey() in arch/arm64/crypto/aes-glue.c for an example

Re: [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT

2018-03-11 Thread Ard Biesheuvel

On 11 March 2018 at 05:16, Vakul Garg <vakul.g...@nxp.com> wrote:
> Hi
>
> How does this patchset affect the throughput performance of crypto?
> Is it expected to increase?
>

This is about latency not throughput. The throughput may decrease
slightly (<1%), but spikes in scheduling latency due to NEON based
crypto should be a thing of the past.

Note that if you require maximum throughput without regard for
scheduling latency, you should disable CONFIG_PREEMPT in your kernel,
in which case these patches do absolutely nothing.

>> -Original Message-
>> From: linux-crypto-ow...@vger.kernel.org [mailto:linux-crypto-
>> ow...@vger.kernel.org] On Behalf Of Ard Biesheuvel
>> Sent: Saturday, March 10, 2018 8:52 PM
>> To: linux-crypto@vger.kernel.org
>> Cc: herb...@gondor.apana.org.au; linux-arm-ker...@lists.infradead.org;
>> Ard Biesheuvel <ard.biesheu...@linaro.org>; Dave Martin
>> <dave.mar...@arm.com>; Russell King - ARM Linux
>> <li...@armlinux.org.uk>; Sebastian Andrzej Siewior
>> <bige...@linutronix.de>; Mark Rutland <mark.rutl...@arm.com>; linux-rt-
>> us...@vger.kernel.org; Peter Zijlstra <pet...@infradead.org>; Catalin
>> Marinas <catalin.mari...@arm.com>; Will Deacon
>> <will.dea...@arm.com>; Steven Rostedt <rost...@goodmis.org>; Thomas
>> Gleixner <t...@linutronix.de>
>> Subject: [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT
>>
>> As reported by Sebastian, the way the arm64 NEON crypto code currently
>> keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
>> causing problems with RT builds, given that the skcipher walk API may
>> allocate and free temporary buffers it uses to present the input and output
>> arrays to the crypto algorithm in blocksize sized chunks (where blocksize is
>> the natural blocksize of the crypto algorithm), and doing so with NEON
>> enabled means we're alloc/free'ing memory with preemption disabled.
>>
>> This was deliberate: when this code was introduced, each
>> kernel_neon_begin() and kernel_neon_end() call incurred a fixed penalty of
>> storing resp.
>> loading the contents of all NEON registers to/from memory, and so doing it
>> less often had an obvious performance benefit. However, in the mean time,
>> we have refactored the core kernel mode NEON code, and now
>> kernel_neon_begin() only incurs this penalty the first time it is called 
>> after
>> entering the kernel, and the NEON register restore is deferred until 
>> returning
>> to userland. This means pulling those calls into the loops that iterate over 
>> the
>> input/output of the crypto algorithm is not a big deal anymore (although
>> there are some places in the code where we relied on the NEON registers
>> retaining their values between calls)
>>
>> So let's clean this up for arm64: update the NEON based skcipher drivers to
>> no longer keep the NEON enabled when calling into the skcipher walk API.
>>
>> As pointed out by Peter, this only solves part of the problem. So let's 
>> tackle it
>> more thoroughly, and update the algorithms to test the NEED_RESCHED flag
>> each time after processing a fixed chunk of input.
>>
>> Given that this issue was flagged by the RT people, I would appreciate it if
>> they could confirm whether they are happy with this approach.
>>
>> Changes since v4:
>> - rebase onto v4.16-rc3
>> - apply the same treatment to new SHA512, SHA-3 and SM3 code that landed
>>   in v4.16-rc1
>>
>> Changes since v3:
>> - incorporate Dave's feedback on the asm macros to push/pop frames and to
>> yield
>>   the NEON conditionally
>> - make frame_push/pop more easy to use, by recording the arguments to
>>   frame_push, removing the need to specify them again when calling
>> frame_pop
>> - emit local symbol .Lframe_local_offset to allow code using the frame
>> push/pop
>>   macros to index the stack more easily
>> - use the magic \@ macro invocation counter provided by GAS to generate
>> unique
>>   labels om the NEON yield macros, rather than relying on chance
>>
>> Changes since v2:
>> - Drop logic to yield only after so many blocks - as it turns out, the
>>   throughput of the algorithms that are most likely to be affected by the
>>   overhead (GHASH and AES-CE) only drops by ~1% (on Cortex-A57), and if
>> that
>>   is inacceptable, you are probably not using CONFIG_PREEMPT in the first
>>   place.
>> - Add yield support to the AES-CCM driver
>> - Clean up macros based on feedback from Dave
>> - Giv

[PATCH v5 08/23] crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path

2018-03-10 Thread Ard Biesheuvel

CBC MAC is strictly sequential, and so the current AES code simply
processes the input one block at a time. However, we are about to add
yield support, which adds a bit of overhead, and which we prefer to
align with other modes in terms of granularity (i.e., it is better to
have all routines yield every 64 bytes and not have an exception for
CBC MAC which yields every 16 bytes)

So unroll the loop by 4. We still cannot perform the AES algorithm in
parallel, but we can at least merge the loads and stores.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/aes-modes.S | 23 ++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index e86535a1329d..a68412e1e3a4 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -395,8 +395,28 @@ AES_ENDPROC(aes_xts_decrypt)
 AES_ENTRY(aes_mac_update)
ld1 {v0.16b}, [x4]  /* get dg */
enc_prepare w2, x1, x7
-   cbnzw5, .Lmacenc
+   cbz w5, .Lmacloop4x
 
+   encrypt_block   v0, w2, x1, x7, w8
+
+.Lmacloop4x:
+   subsw3, w3, #4
+   bmi .Lmac1x
+   ld1 {v1.16b-v4.16b}, [x0], #64  /* get next pt block */
+   eor v0.16b, v0.16b, v1.16b  /* ..and xor with dg */
+   encrypt_block   v0, w2, x1, x7, w8
+   eor v0.16b, v0.16b, v2.16b
+   encrypt_block   v0, w2, x1, x7, w8
+   eor v0.16b, v0.16b, v3.16b
+   encrypt_block   v0, w2, x1, x7, w8
+   eor v0.16b, v0.16b, v4.16b
+   cmp w3, wzr
+   csinv   x5, x6, xzr, eq
+   cbz w5, .Lmacout
+   encrypt_block   v0, w2, x1, x7, w8
+   b   .Lmacloop4x
+.Lmac1x:
+   add w3, w3, #4
 .Lmacloop:
cbz w3, .Lmacout
ld1 {v1.16b}, [x0], #16 /* get next pt block */
@@ -406,7 +426,6 @@ AES_ENTRY(aes_mac_update)
csinv   x5, x6, xzr, eq
cbz w5, .Lmacout
 
-.Lmacenc:
encrypt_block   v0, w2, x1, x7, w8
b   .Lmacloop
 
-- 
2.15.1

[PATCH v5 23/23] DO NOT MERGE

2018-03-10 Thread Ard Biesheuvel

Test code to force a kernel_neon_end+begin sequence at every yield point,
and wipe the entire NEON state before resuming the algorithm.
---
 arch/arm64/include/asm/assembler.h | 33 
 1 file changed, 33 insertions(+)

diff --git a/arch/arm64/include/asm/assembler.h 
b/arch/arm64/include/asm/assembler.h
index 61168cbe9781..b471b0bbdfe6 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -678,6 +678,7 @@ alternative_else_nop_endif
cmp w1, #PREEMPT_DISABLE_OFFSET
cselx0, x0, xzr, eq
tbnzx0, #TIF_NEED_RESCHED, .Lyield_\@   // needs 
rescheduling?
+   b   .Lyield_\@
 #endif
/* fall through to endif_yield_neon */
.subsection 1
@@ -687,6 +688,38 @@ alternative_else_nop_endif
.macro  do_cond_yield_neon
bl  kernel_neon_end
bl  kernel_neon_begin
+   moviv0.16b, #0x55
+   moviv1.16b, #0x55
+   moviv2.16b, #0x55
+   moviv3.16b, #0x55
+   moviv4.16b, #0x55
+   moviv5.16b, #0x55
+   moviv6.16b, #0x55
+   moviv7.16b, #0x55
+   moviv8.16b, #0x55
+   moviv9.16b, #0x55
+   moviv10.16b, #0x55
+   moviv11.16b, #0x55
+   moviv12.16b, #0x55
+   moviv13.16b, #0x55
+   moviv14.16b, #0x55
+   moviv15.16b, #0x55
+   moviv16.16b, #0x55
+   moviv17.16b, #0x55
+   moviv18.16b, #0x55
+   moviv19.16b, #0x55
+   moviv20.16b, #0x55
+   moviv21.16b, #0x55
+   moviv22.16b, #0x55
+   moviv23.16b, #0x55
+   moviv24.16b, #0x55
+   moviv25.16b, #0x55
+   moviv26.16b, #0x55
+   moviv27.16b, #0x55
+   moviv28.16b, #0x55
+   moviv29.16b, #0x55
+   moviv30.16b, #0x55
+   moviv31.16b, #0x55
.endm
 
.macro  endif_yield_neon, lbl
-- 
2.15.1

[PATCH v5 11/23] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT

2018-03-10 Thread Ard Biesheuvel

Add support macros to conditionally yield the NEON (and thus the CPU)
that may be called from the assembler code.

In some cases, yielding the NEON involves saving and restoring a non
trivial amount of context (especially in the CRC folding algorithms),
and so the macro is split into three, and the code in between is only
executed when the yield path is taken, allowing the context to be preserved.
The third macro takes an optional label argument that marks the resume
path after a yield has been performed.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/include/asm/assembler.h | 64 
 arch/arm64/kernel/asm-offsets.c|  2 +
 2 files changed, 66 insertions(+)

diff --git a/arch/arm64/include/asm/assembler.h 
b/arch/arm64/include/asm/assembler.h
index eef1fd2c1c0b..61168cbe9781 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -635,4 +635,68 @@ alternative_else_nop_endif
.endif
.endm
 
+/*
+ * Check whether to yield to another runnable task from kernel mode NEON code
+ * (which runs with preemption disabled).
+ *
+ * if_will_cond_yield_neon
+ *// pre-yield patchup code
+ * do_cond_yield_neon
+ *// post-yield patchup code
+ * endif_yield_neon
+ *
+ * where  is optional, and marks the point where execution will resume
+ * after a yield has been performed. If omitted, execution resumes right after
+ * the endif_yield_neon invocation.
+ *
+ * Note that the patchup code does not support assembler directives that change
+ * the output section, any use of such directives is undefined.
+ *
+ * The yield itself consists of the following:
+ * - Check whether the preempt count is exactly 1, in which case disabling
+ *   preemption once will make the task preemptible. If this is not the case,
+ *   yielding is pointless.
+ * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
+ *   kernel mode NEON (which will trigger a reschedule), and branch to the
+ *   yield fixup code.
+ *
+ * This macro sequence clobbers x0, x1 and the flags register unconditionally,
+ * and may clobber x2 .. x18 if the yield path is taken.
+ */
+
+   .macro  cond_yield_neon, lbl
+   if_will_cond_yield_neon
+   do_cond_yield_neon
+   endif_yield_neon\lbl
+   .endm
+
+   .macro  if_will_cond_yield_neon
+#ifdef CONFIG_PREEMPT
+   get_thread_info x0
+   ldr w1, [x0, #TSK_TI_PREEMPT]
+   ldr x0, [x0, #TSK_TI_FLAGS]
+   cmp w1, #PREEMPT_DISABLE_OFFSET
+   cselx0, x0, xzr, eq
+   tbnzx0, #TIF_NEED_RESCHED, .Lyield_\@   // needs 
rescheduling?
+#endif
+   /* fall through to endif_yield_neon */
+   .subsection 1
+.Lyield_\@ :
+   .endm
+
+   .macro  do_cond_yield_neon
+   bl  kernel_neon_end
+   bl  kernel_neon_begin
+   .endm
+
+   .macro  endif_yield_neon, lbl
+   .ifnb   \lbl
+   b   \lbl
+   .else
+   b   .Lyield_out_\@
+   .endif
+   .previous
+.Lyield_out_\@ :
+   .endm
+
 #endif /* __ASM_ASSEMBLER_H */
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 1303e04110cd..1e2ea2e51acb 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -93,6 +93,8 @@ int main(void)
   DEFINE(DMA_TO_DEVICE,DMA_TO_DEVICE);
   DEFINE(DMA_FROM_DEVICE,  DMA_FROM_DEVICE);
   BLANK();
+  DEFINE(PREEMPT_DISABLE_OFFSET, PREEMPT_DISABLE_OFFSET);
+  BLANK();
   DEFINE(CLOCK_REALTIME,   CLOCK_REALTIME);
   DEFINE(CLOCK_MONOTONIC,  CLOCK_MONOTONIC);
   DEFINE(CLOCK_MONOTONIC_RAW,  CLOCK_MONOTONIC_RAW);
-- 
2.15.1

[PATCH v5 14/23] crypto: arm64/aes-ccm - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/aes-ce-ccm-core.S | 150 +---
 1 file changed, 95 insertions(+), 55 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S 
b/arch/arm64/crypto/aes-ce-ccm-core.S
index e3a375c4cb83..88f5aef7934c 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -19,24 +19,33 @@
 *   u32 *macp, u8 const rk[], u32 rounds);
 */
 ENTRY(ce_aes_ccm_auth_data)
-   ldr w8, [x3]/* leftover from prev round? */
+   frame_push  7
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+   mov x24, x5
+
+   ldr w25, [x22]  /* leftover from prev round? */
ld1 {v0.16b}, [x0]  /* load mac */
-   cbz w8, 1f
-   sub w8, w8, #16
+   cbz w25, 1f
+   sub w25, w25, #16
eor v1.16b, v1.16b, v1.16b
-0: ldrbw7, [x1], #1/* get 1 byte of input */
-   subsw2, w2, #1
-   add w8, w8, #1
+0: ldrbw7, [x20], #1   /* get 1 byte of input */
+   subsw21, w21, #1
+   add w25, w25, #1
ins v1.b[0], w7
ext v1.16b, v1.16b, v1.16b, #1  /* rotate in the input bytes */
beq 8f  /* out of input? */
-   cbnzw8, 0b
+   cbnzw25, 0b
eor v0.16b, v0.16b, v1.16b
-1: ld1 {v3.4s}, [x4]   /* load first round key */
-   prfmpldl1strm, [x1]
-   cmp w5, #12 /* which key size? */
-   add x6, x4, #16
-   sub w7, w5, #2  /* modified # of rounds */
+1: ld1 {v3.4s}, [x23]  /* load first round key */
+   prfmpldl1strm, [x20]
+   cmp w24, #12/* which key size? */
+   add x6, x23, #16
+   sub w7, w24, #2 /* modified # of rounds */
bmi 2f
bne 5f
mov v5.16b, v3.16b
@@ -55,33 +64,43 @@ ENTRY(ce_aes_ccm_auth_data)
ld1 {v5.4s}, [x6], #16  /* load next round key */
bpl 3b
aesev0.16b, v4.16b
-   subsw2, w2, #16 /* last data? */
+   subsw21, w21, #16   /* last data? */
eor v0.16b, v0.16b, v5.16b  /* final round */
bmi 6f
-   ld1 {v1.16b}, [x1], #16 /* load next input block */
+   ld1 {v1.16b}, [x20], #16/* load next input block */
eor v0.16b, v0.16b, v1.16b  /* xor with mac */
-   bne 1b
-6: st1 {v0.16b}, [x0]  /* store mac */
+   beq 6f
+
+   if_will_cond_yield_neon
+   st1 {v0.16b}, [x19] /* store mac */
+   do_cond_yield_neon
+   ld1 {v0.16b}, [x19] /* reload mac */
+   endif_yield_neon
+
+   b   1b
+6: st1 {v0.16b}, [x19] /* store mac */
beq 10f
-   addsw2, w2, #16
+   addsw21, w21, #16
beq 10f
-   mov w8, w2
-7: ldrbw7, [x1], #1
+   mov w25, w21
+7: ldrbw7, [x20], #1
umovw6, v0.b[0]
eor w6, w6, w7
-   strbw6, [x0], #1
-   subsw2, w2, #1
+   strbw6, [x19], #1
+   subsw21, w21, #1
beq 10f
ext v0.16b, v0.16b, v0.16b, #1  /* rotate out the mac bytes */
b   7b
-8: mov w7, w8
-   add w8, w8, #16
+8: mov w7, w25
+   add w25, w25, #16
 9: ext v1.16b, v1.16b, v1.16b, #1
addsw7, w7, #1
bne 9b
eor v0.16b, v0.16b, v1.16b
-   st1 {v0.16b}, [x0]
-10:str w8, [x3]
+   st1 {v0.16b}, [x19]
+10:str w25, [x22]
+
+   frame_pop
ret
 ENDPROC(ce_aes_ccm_auth_data)
 
@@ -126,19 +145,29 @@ ENTRY(ce_aes_ccm_final)
 ENDPROC(ce_aes_ccm_final)
 
.macro  aes_ccm_do_crypt,enc
-   ldr x8, [x6, #8]/* load lower ctr */
-   ld1 {v0.16b}, [x5]  /* load mac */
-CPU_LE(rev x8, x8  )   /* keep swabbed ctr in 
reg */
+   frame_push  8
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+   mov x24, x5
+   mov x25, x6
+
+   ldr x26, [x25, #8]  /* load lower ctr */
+   ld1 {v0.16b}, [x24] /* load mac */
+CPU_LE(rev x26, x26)   /* keep swabb

[PATCH v5 20/23] crypto: arm64/sha3-ce - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/sha3-ce-core.S | 77 +---
 1 file changed, 50 insertions(+), 27 deletions(-)

diff --git a/arch/arm64/crypto/sha3-ce-core.S b/arch/arm64/crypto/sha3-ce-core.S
index 332ad7530690..a7d587fa54f6 100644
--- a/arch/arm64/crypto/sha3-ce-core.S
+++ b/arch/arm64/crypto/sha3-ce-core.S
@@ -41,9 +41,16 @@
 */
.text
 ENTRY(sha3_ce_transform)
-   /* load state */
-   add x8, x0, #32
-   ld1 { v0.1d- v3.1d}, [x0]
+   frame_push  4
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+
+0: /* load state */
+   add x8, x19, #32
+   ld1 { v0.1d- v3.1d}, [x19]
ld1 { v4.1d- v7.1d}, [x8], #32
ld1 { v8.1d-v11.1d}, [x8], #32
ld1 {v12.1d-v15.1d}, [x8], #32
@@ -51,13 +58,13 @@ ENTRY(sha3_ce_transform)
ld1 {v20.1d-v23.1d}, [x8], #32
ld1 {v24.1d}, [x8]
 
-0: sub w2, w2, #1
+1: sub w21, w21, #1
mov w8, #24
adr_l   x9, .Lsha3_rcon
 
/* load input */
-   ld1 {v25.8b-v28.8b}, [x1], #32
-   ld1 {v29.8b-v31.8b}, [x1], #24
+   ld1 {v25.8b-v28.8b}, [x20], #32
+   ld1 {v29.8b-v31.8b}, [x20], #24
eor v0.8b, v0.8b, v25.8b
eor v1.8b, v1.8b, v26.8b
eor v2.8b, v2.8b, v27.8b
@@ -66,10 +73,10 @@ ENTRY(sha3_ce_transform)
eor v5.8b, v5.8b, v30.8b
eor v6.8b, v6.8b, v31.8b
 
-   tbnzx3, #6, 2f  // SHA3-512
+   tbnzx22, #6, 3f // SHA3-512
 
-   ld1 {v25.8b-v28.8b}, [x1], #32
-   ld1 {v29.8b-v30.8b}, [x1], #16
+   ld1 {v25.8b-v28.8b}, [x20], #32
+   ld1 {v29.8b-v30.8b}, [x20], #16
eor  v7.8b,  v7.8b, v25.8b
eor  v8.8b,  v8.8b, v26.8b
eor  v9.8b,  v9.8b, v27.8b
@@ -77,34 +84,34 @@ ENTRY(sha3_ce_transform)
eor v11.8b, v11.8b, v29.8b
eor v12.8b, v12.8b, v30.8b
 
-   tbnzx3, #4, 1f  // SHA3-384 or SHA3-224
+   tbnzx22, #4, 2f // SHA3-384 or SHA3-224
 
// SHA3-256
-   ld1 {v25.8b-v28.8b}, [x1], #32
+   ld1 {v25.8b-v28.8b}, [x20], #32
eor v13.8b, v13.8b, v25.8b
eor v14.8b, v14.8b, v26.8b
eor v15.8b, v15.8b, v27.8b
eor v16.8b, v16.8b, v28.8b
-   b   3f
+   b   4f
 
-1: tbz x3, #2, 3f  // bit 2 cleared? SHA-384
+2: tbz x22, #2, 4f // bit 2 cleared? SHA-384
 
// SHA3-224
-   ld1 {v25.8b-v28.8b}, [x1], #32
-   ld1 {v29.8b}, [x1], #8
+   ld1 {v25.8b-v28.8b}, [x20], #32
+   ld1 {v29.8b}, [x20], #8
eor v13.8b, v13.8b, v25.8b
eor v14.8b, v14.8b, v26.8b
eor v15.8b, v15.8b, v27.8b
eor v16.8b, v16.8b, v28.8b
eor v17.8b, v17.8b, v29.8b
-   b   3f
+   b   4f
 
// SHA3-512
-2: ld1 {v25.8b-v26.8b}, [x1], #16
+3: ld1 {v25.8b-v26.8b}, [x20], #16
eor  v7.8b,  v7.8b, v25.8b
eor  v8.8b,  v8.8b, v26.8b
 
-3: sub w8, w8, #1
+4: sub w8, w8, #1
 
eor3v29.16b,  v4.16b,  v9.16b, v14.16b
eor3v26.16b,  v1.16b,  v6.16b, v11.16b
@@ -183,17 +190,33 @@ ENTRY(sha3_ce_transform)
 
eor  v0.16b,  v0.16b, v31.16b
 
-   cbnzw8, 3b
-   cbnzw2, 0b
+   cbnzw8, 4b
+   cbz w21, 5f
+
+   if_will_cond_yield_neon
+   add x8, x19, #32
+   st1 { v0.1d- v3.1d}, [x19]
+   st1 { v4.1d- v7.1d}, [x8], #32
+   st1 { v8.1d-v11.1d}, [x8], #32
+   st1 {v12.1d-v15.1d}, [x8], #32
+   st1 {v16.1d-v19.1d}, [x8], #32
+   st1 {v20.1d-v23.1d}, [x8], #32
+   st1 {v24.1d}, [x8]
+   do_cond_yield_neon
+   b   0b
+   endif_yield_neon
+
+   b   1b
 
/* save state */
-   st1 { v0.1d- v3.1d}, [x0], #32
-   st1 { v4.1d- v7.1d}, [x0], #32
-   st1 { v8.1d-v11.1d}, [x0], #32
-   st1 {v12.1d-v15.1d}, [x0], #32
-   st1 {v16.1d-v19.1d}, [x0], #32
-   st1 {v20.1d-v23.1d}, [x0], #32
-   st1 {v24.1d}, [x0]
+5: st1 { v0.1d- v3.1d}, [x19], #32
+   st1 { v4.1d- v7.1d}, [x19], #32
+   st1 { v8.1d-v11.1d}, [x19], #32
+   st1 {v12.1d-v15.1d}, [x19], #32
+   st1 {v16.1d-v19.1d}, [x19], #32
+   st1 {v20.1d-v23.1d}, [x19], #32
+   st1 {v24.1d}, [x19]
+   frame_pop
ret
 ENDPROC(sha3_ce_transform)
 
-- 
2.15.1

[PATCH v5 22/23] crypto: arm64/sm3-ce - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/sm3-ce-core.S | 30 +++-
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/crypto/sm3-ce-core.S b/arch/arm64/crypto/sm3-ce-core.S
index 27169fe07a68..5a116c8d0cee 100644
--- a/arch/arm64/crypto/sm3-ce-core.S
+++ b/arch/arm64/crypto/sm3-ce-core.S
@@ -77,19 +77,25 @@
 */
.text
 ENTRY(sm3_ce_transform)
+   frame_push  3
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+
/* load state */
-   ld1 {v8.4s-v9.4s}, [x0]
+   ld1 {v8.4s-v9.4s}, [x19]
rev64   v8.4s, v8.4s
rev64   v9.4s, v9.4s
ext v8.16b, v8.16b, v8.16b, #8
ext v9.16b, v9.16b, v9.16b, #8
 
-   adr_l   x8, .Lt
+0: adr_l   x8, .Lt
ldp s13, s14, [x8]
 
/* load input */
-0: ld1 {v0.16b-v3.16b}, [x1], #64
-   sub w2, w2, #1
+1: ld1 {v0.16b-v3.16b}, [x20], #64
+   sub w21, w21, #1
 
mov v15.16b, v8.16b
mov v16.16b, v9.16b
@@ -125,14 +131,24 @@ CPU_LE(   rev32   v3.16b, v3.16b  )
eor v9.16b, v9.16b, v16.16b
 
/* handled all input blocks? */
-   cbnzw2, 0b
+   cbz w21, 2f
+
+   if_will_cond_yield_neon
+   st1 {v8.4s-v9.4s}, [x19]
+   do_cond_yield_neon
+   ld1 {v8.4s-v9.4s}, [x19]
+   b   0b
+   endif_yield_neon
+
+   b   1b
 
/* save state */
-   rev64   v8.4s, v8.4s
+2: rev64   v8.4s, v8.4s
rev64   v9.4s, v9.4s
ext v8.16b, v8.16b, v8.16b, #8
ext v9.16b, v9.16b, v9.16b, #8
-   st1 {v8.4s-v9.4s}, [x0]
+   st1 {v8.4s-v9.4s}, [x19]
+   frame_pop
ret
 ENDPROC(sm3_ce_transform)
 
-- 
2.15.1

[PATCH v5 13/23] crypto: arm64/sha2-ce - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/sha2-ce-core.S | 37 ++--
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/crypto/sha2-ce-core.S b/arch/arm64/crypto/sha2-ce-core.S
index 4c3c89b812ce..cd8b36412469 100644
--- a/arch/arm64/crypto/sha2-ce-core.S
+++ b/arch/arm64/crypto/sha2-ce-core.S
@@ -79,30 +79,36 @@
 */
.text
 ENTRY(sha2_ce_transform)
+   frame_push  3
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+
/* load round constants */
-   adr_l   x8, .Lsha2_rcon
+0: adr_l   x8, .Lsha2_rcon
ld1 { v0.4s- v3.4s}, [x8], #64
ld1 { v4.4s- v7.4s}, [x8], #64
ld1 { v8.4s-v11.4s}, [x8], #64
ld1 {v12.4s-v15.4s}, [x8]
 
/* load state */
-   ld1 {dgav.4s, dgbv.4s}, [x0]
+   ld1 {dgav.4s, dgbv.4s}, [x19]
 
/* load sha256_ce_state::finalize */
ldr_l   w4, sha256_ce_offsetof_finalize, x4
-   ldr w4, [x0, x4]
+   ldr w4, [x19, x4]
 
/* load input */
-0: ld1 {v16.4s-v19.4s}, [x1], #64
-   sub w2, w2, #1
+1: ld1 {v16.4s-v19.4s}, [x20], #64
+   sub w21, w21, #1
 
 CPU_LE(rev32   v16.16b, v16.16b)
 CPU_LE(rev32   v17.16b, v17.16b)
 CPU_LE(rev32   v18.16b, v18.16b)
 CPU_LE(rev32   v19.16b, v19.16b)
 
-1: add t0.4s, v16.4s, v0.4s
+2: add t0.4s, v16.4s, v0.4s
mov dg0v.16b, dgav.16b
mov dg1v.16b, dgbv.16b
 
@@ -131,16 +137,24 @@ CPU_LE(   rev32   v19.16b, v19.16b)
add dgbv.4s, dgbv.4s, dg1v.4s
 
/* handled all input blocks? */
-   cbnzw2, 0b
+   cbz w21, 3f
+
+   if_will_cond_yield_neon
+   st1 {dgav.4s, dgbv.4s}, [x19]
+   do_cond_yield_neon
+   b   0b
+   endif_yield_neon
+
+   b   1b
 
/*
 * Final block: add padding and total bit count.
 * Skip if the input size was not a round multiple of the block size,
 * the padding is handled by the C code in that case.
 */
-   cbz x4, 3f
+3: cbz x4, 4f
ldr_l   w4, sha256_ce_offsetof_count, x4
-   ldr x4, [x0, x4]
+   ldr x4, [x19, x4]
moviv17.2d, #0
mov x8, #0x8000
moviv18.2d, #0
@@ -149,9 +163,10 @@ CPU_LE(rev32   v19.16b, v19.16b)
mov x4, #0
mov v19.d[0], xzr
mov v19.d[1], x7
-   b   1b
+   b   2b
 
/* store new state */
-3: st1 {dgav.4s, dgbv.4s}, [x0]
+4: st1 {dgav.4s, dgbv.4s}, [x19]
+   frame_pop
ret
 ENDPROC(sha2_ce_transform)
-- 
2.15.1

[PATCH v5 09/23] crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels

2018-03-10 Thread Ard Biesheuvel

Tweak the SHA256 update routines to invoke the SHA256 block transform
block by block, to avoid excessive scheduling delays caused by the
NEON algorithm running with preemption disabled.

Also, remove a stale comment which no longer applies now that kernel
mode NEON is actually disallowed in some contexts.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/sha256-glue.c | 36 +---
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/crypto/sha256-glue.c b/arch/arm64/crypto/sha256-glue.c
index b064d925fe2a..e8880ccdc71f 100644
--- a/arch/arm64/crypto/sha256-glue.c
+++ b/arch/arm64/crypto/sha256-glue.c
@@ -89,21 +89,32 @@ static struct shash_alg algs[] = { {
 static int sha256_update_neon(struct shash_desc *desc, const u8 *data,
  unsigned int len)
 {
-   /*
-* Stacking and unstacking a substantial slice of the NEON register
-* file may significantly affect performance for small updates when
-* executing in interrupt context, so fall back to the scalar code
-* in that case.
-*/
+   struct sha256_state *sctx = shash_desc_ctx(desc);
+
if (!may_use_simd())
return sha256_base_do_update(desc, data, len,
(sha256_block_fn *)sha256_block_data_order);
 
-   kernel_neon_begin();
-   sha256_base_do_update(desc, data, len,
-   (sha256_block_fn *)sha256_block_neon);
-   kernel_neon_end();
+   while (len > 0) {
+   unsigned int chunk = len;
+
+   /*
+* Don't hog the CPU for the entire time it takes to process all
+* input when running on a preemptible kernel, but process the
+* data block by block instead.
+*/
+   if (IS_ENABLED(CONFIG_PREEMPT) &&
+   chunk + sctx->count % SHA256_BLOCK_SIZE > SHA256_BLOCK_SIZE)
+   chunk = SHA256_BLOCK_SIZE -
+   sctx->count % SHA256_BLOCK_SIZE;
 
+   kernel_neon_begin();
+   sha256_base_do_update(desc, data, chunk,
+ (sha256_block_fn *)sha256_block_neon);
+   kernel_neon_end();
+   data += chunk;
+   len -= chunk;
+   }
return 0;
 }
 
@@ -117,10 +128,9 @@ static int sha256_finup_neon(struct shash_desc *desc, 
const u8 *data,
sha256_base_do_finalize(desc,
(sha256_block_fn *)sha256_block_data_order);
} else {
-   kernel_neon_begin();
if (len)
-   sha256_base_do_update(desc, data, len,
-   (sha256_block_fn *)sha256_block_neon);
+   sha256_update_neon(desc, data, len);
+   kernel_neon_begin();
sha256_base_do_finalize(desc,
(sha256_block_fn *)sha256_block_neon);
kernel_neon_end();
-- 
2.15.1

[PATCH v5 18/23] crypto: arm64/crc32-ce - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/crc32-ce-core.S | 40 +++-
 1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/crypto/crc32-ce-core.S 
b/arch/arm64/crypto/crc32-ce-core.S
index 16ed3c7ebd37..8061bf0f9c66 100644
--- a/arch/arm64/crypto/crc32-ce-core.S
+++ b/arch/arm64/crypto/crc32-ce-core.S
@@ -100,9 +100,10 @@
dCONSTANT   .reqd0
qCONSTANT   .reqq0
 
-   BUF .reqx0
-   LEN .reqx1
-   CRC .reqx2
+   BUF .reqx19
+   LEN .reqx20
+   CRC .reqx21
+   CONST   .reqx22
 
vzr .reqv9
 
@@ -123,7 +124,14 @@ ENTRY(crc32_pmull_le)
 ENTRY(crc32c_pmull_le)
adr_l   x3, .Lcrc32c_constants
 
-0: bic LEN, LEN, #15
+0: frame_push  4, 64
+
+   mov BUF, x0
+   mov LEN, x1
+   mov CRC, x2
+   mov CONST, x3
+
+   bic LEN, LEN, #15
ld1 {v1.16b-v4.16b}, [BUF], #0x40
movivzr.16b, #0
fmovdCONSTANT, CRC
@@ -132,7 +140,7 @@ ENTRY(crc32c_pmull_le)
cmp LEN, #0x40
b.ltless_64
 
-   ldr qCONSTANT, [x3]
+   ldr qCONSTANT, [CONST]
 
 loop_64:   /* 64 bytes Full cache line folding */
sub LEN, LEN, #0x40
@@ -162,10 +170,21 @@ loop_64:  /* 64 bytes Full cache line folding */
eor v4.16b, v4.16b, v8.16b
 
cmp LEN, #0x40
-   b.geloop_64
+   b.ltless_64
+
+   if_will_cond_yield_neon
+   stp q1, q2, [sp, #.Lframe_local_offset]
+   stp q3, q4, [sp, #.Lframe_local_offset + 32]
+   do_cond_yield_neon
+   ldp q1, q2, [sp, #.Lframe_local_offset]
+   ldp q3, q4, [sp, #.Lframe_local_offset + 32]
+   ldr qCONSTANT, [CONST]
+   movivzr.16b, #0
+   endif_yield_neon
+   b   loop_64
 
 less_64:   /* Folding cache line into 128bit */
-   ldr qCONSTANT, [x3, #16]
+   ldr qCONSTANT, [CONST, #16]
 
pmull2  v5.1q, v1.2d, vCONSTANT.2d
pmull   v1.1q, v1.1d, vCONSTANT.1d
@@ -204,8 +223,8 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
 
/* final 32-bit fold */
-   ldr dCONSTANT, [x3, #32]
-   ldr d3, [x3, #40]
+   ldr dCONSTANT, [CONST, #32]
+   ldr d3, [CONST, #40]
 
ext v2.16b, v1.16b, vzr.16b, #4
and v1.16b, v1.16b, v3.16b
@@ -213,7 +232,7 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
 
/* Finish up with the bit-reversed barrett reduction 64 ==> 32 bits */
-   ldr qCONSTANT, [x3, #48]
+   ldr qCONSTANT, [CONST, #48]
 
and v2.16b, v1.16b, v3.16b
ext v2.16b, vzr.16b, v2.16b, #8
@@ -223,6 +242,7 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
mov w0, v1.s[1]
 
+   frame_pop
ret
 ENDPROC(crc32_pmull_le)
 ENDPROC(crc32c_pmull_le)
-- 
2.15.1

[PATCH v5 16/23] crypto: arm64/aes-bs - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/aes-neonbs-core.S | 305 +++-
 1 file changed, 170 insertions(+), 135 deletions(-)

diff --git a/arch/arm64/crypto/aes-neonbs-core.S 
b/arch/arm64/crypto/aes-neonbs-core.S
index ca0472500433..e613a87f8b53 100644
--- a/arch/arm64/crypto/aes-neonbs-core.S
+++ b/arch/arm64/crypto/aes-neonbs-core.S
@@ -565,54 +565,61 @@ ENDPROC(aesbs_decrypt8)
 *   int blocks)
 */
.macro  __ecb_crypt, do8, o0, o1, o2, o3, o4, o5, o6, o7
-   stp x29, x30, [sp, #-16]!
-   mov x29, sp
+   frame_push  5
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
 
 99:mov x5, #1
-   lsl x5, x5, x4
-   subsw4, w4, #8
-   cselx4, x4, xzr, pl
+   lsl x5, x5, x23
+   subsw23, w23, #8
+   cselx23, x23, xzr, pl
cselx5, x5, xzr, mi
 
-   ld1 {v0.16b}, [x1], #16
+   ld1 {v0.16b}, [x20], #16
tbnzx5, #1, 0f
-   ld1 {v1.16b}, [x1], #16
+   ld1 {v1.16b}, [x20], #16
tbnzx5, #2, 0f
-   ld1 {v2.16b}, [x1], #16
+   ld1 {v2.16b}, [x20], #16
tbnzx5, #3, 0f
-   ld1 {v3.16b}, [x1], #16
+   ld1 {v3.16b}, [x20], #16
tbnzx5, #4, 0f
-   ld1 {v4.16b}, [x1], #16
+   ld1 {v4.16b}, [x20], #16
tbnzx5, #5, 0f
-   ld1 {v5.16b}, [x1], #16
+   ld1 {v5.16b}, [x20], #16
tbnzx5, #6, 0f
-   ld1 {v6.16b}, [x1], #16
+   ld1 {v6.16b}, [x20], #16
tbnzx5, #7, 0f
-   ld1 {v7.16b}, [x1], #16
+   ld1 {v7.16b}, [x20], #16
 
-0: mov bskey, x2
-   mov rounds, x3
+0: mov bskey, x21
+   mov rounds, x22
bl  \do8
 
-   st1 {\o0\().16b}, [x0], #16
+   st1 {\o0\().16b}, [x19], #16
tbnzx5, #1, 1f
-   st1 {\o1\().16b}, [x0], #16
+   st1 {\o1\().16b}, [x19], #16
tbnzx5, #2, 1f
-   st1 {\o2\().16b}, [x0], #16
+   st1 {\o2\().16b}, [x19], #16
tbnzx5, #3, 1f
-   st1 {\o3\().16b}, [x0], #16
+   st1 {\o3\().16b}, [x19], #16
tbnzx5, #4, 1f
-   st1 {\o4\().16b}, [x0], #16
+   st1 {\o4\().16b}, [x19], #16
tbnzx5, #5, 1f
-   st1 {\o5\().16b}, [x0], #16
+   st1 {\o5\().16b}, [x19], #16
tbnzx5, #6, 1f
-   st1 {\o6\().16b}, [x0], #16
+   st1 {\o6\().16b}, [x19], #16
tbnzx5, #7, 1f
-   st1 {\o7\().16b}, [x0], #16
+   st1 {\o7\().16b}, [x19], #16
 
-   cbnzx4, 99b
+   cbz x23, 1f
+   cond_yield_neon
+   b   99b
 
-1: ldp x29, x30, [sp], #16
+1: frame_pop
ret
.endm
 
@@ -632,43 +639,49 @@ ENDPROC(aesbs_ecb_decrypt)
 */
.align  4
 ENTRY(aesbs_cbc_decrypt)
-   stp x29, x30, [sp, #-16]!
-   mov x29, sp
+   frame_push  6
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+   mov x24, x5
 
 99:mov x6, #1
-   lsl x6, x6, x4
-   subsw4, w4, #8
-   cselx4, x4, xzr, pl
+   lsl x6, x6, x23
+   subsw23, w23, #8
+   cselx23, x23, xzr, pl
cselx6, x6, xzr, mi
 
-   ld1 {v0.16b}, [x1], #16
+   ld1 {v0.16b}, [x20], #16
mov v25.16b, v0.16b
tbnzx6, #1, 0f
-   ld1 {v1.16b}, [x1], #16
+   ld1 {v1.16b}, [x20], #16
mov v26.16b, v1.16b
tbnzx6, #2, 0f
-   ld1 {v2.16b}, [x1], #16
+   ld1 {v2.16b}, [x20], #16
mov v27.16b, v2.16b
tbnzx6, #3, 0f
-   ld1 {v3.16b}, [x1], #16
+   ld1 {v3.16b}, [x20], #16
mov v28.16b, v3.16b
tbnzx6, #4, 0f
-

[PATCH v5 15/23] crypto: arm64/aes-blk - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/aes-ce.S|  15 +-
 arch/arm64/crypto/aes-modes.S | 331 
 2 files changed, 216 insertions(+), 130 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce.S b/arch/arm64/crypto/aes-ce.S
index 50330f5c3adc..623e74ed1c67 100644
--- a/arch/arm64/crypto/aes-ce.S
+++ b/arch/arm64/crypto/aes-ce.S
@@ -30,18 +30,21 @@
.endm
 
/* prepare for encryption with key in rk[] */
-   .macro  enc_prepare, rounds, rk, ignore
-   load_round_keys \rounds, \rk
+   .macro  enc_prepare, rounds, rk, temp
+   mov \temp, \rk
+   load_round_keys \rounds, \temp
.endm
 
/* prepare for encryption (again) but with new key in rk[] */
-   .macro  enc_switch_key, rounds, rk, ignore
-   load_round_keys \rounds, \rk
+   .macro  enc_switch_key, rounds, rk, temp
+   mov \temp, \rk
+   load_round_keys \rounds, \temp
.endm
 
/* prepare for decryption with key in rk[] */
-   .macro  dec_prepare, rounds, rk, ignore
-   load_round_keys \rounds, \rk
+   .macro  dec_prepare, rounds, rk, temp
+   mov \temp, \rk
+   load_round_keys \rounds, \temp
.endm
 
.macro  do_enc_Nx, de, mc, k, i0, i1, i2, i3
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index a68412e1e3a4..483a7130cf0e 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -14,12 +14,12 @@
.align  4
 
 aes_encrypt_block4x:
-   encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
+   encrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
ret
 ENDPROC(aes_encrypt_block4x)
 
 aes_decrypt_block4x:
-   decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
+   decrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
ret
 ENDPROC(aes_decrypt_block4x)
 
@@ -31,57 +31,71 @@ ENDPROC(aes_decrypt_block4x)
 */
 
 AES_ENTRY(aes_ecb_encrypt)
-   stp x29, x30, [sp, #-16]!
-   mov x29, sp
+   frame_push  5
 
-   enc_prepare w3, x2, x5
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+
+.Lecbencrestart:
+   enc_prepare w22, x21, x5
 
 .LecbencloopNx:
-   subsw4, w4, #4
+   subsw23, w23, #4
bmi .Lecbenc1x
-   ld1 {v0.16b-v3.16b}, [x1], #64  /* get 4 pt blocks */
+   ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
bl  aes_encrypt_block4x
-   st1 {v0.16b-v3.16b}, [x0], #64
+   st1 {v0.16b-v3.16b}, [x19], #64
+   cond_yield_neon .Lecbencrestart
b   .LecbencloopNx
 .Lecbenc1x:
-   addsw4, w4, #4
+   addsw23, w23, #4
beq .Lecbencout
 .Lecbencloop:
-   ld1 {v0.16b}, [x1], #16 /* get next pt block */
-   encrypt_block   v0, w3, x2, x5, w6
-   st1 {v0.16b}, [x0], #16
-   subsw4, w4, #1
+   ld1 {v0.16b}, [x20], #16/* get next pt block */
+   encrypt_block   v0, w22, x21, x5, w6
+   st1 {v0.16b}, [x19], #16
+   subsw23, w23, #1
bne .Lecbencloop
 .Lecbencout:
-   ldp x29, x30, [sp], #16
+   frame_pop
ret
 AES_ENDPROC(aes_ecb_encrypt)
 
 
 AES_ENTRY(aes_ecb_decrypt)
-   stp x29, x30, [sp, #-16]!
-   mov x29, sp
+   frame_push  5
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
 
-   dec_prepare w3, x2, x5
+.Lecbdecrestart:
+   dec_prepare w22, x21, x5
 
 .LecbdecloopNx:
-   subsw4, w4, #4
+   subsw23, w23, #4
bmi .Lecbdec1x
-   ld1 {v0.16b-v3.16b}, [x1], #64  /* get 4 ct blocks */
+   ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
bl  aes_decrypt_block4x
-   st1 {v0.16b-v3.16b}, [x0], #64
+   st1 {v0.16b-v3.16b}, [x19], #64
+   cond_yield_neon .Lecbdecrestart
b   .LecbdecloopNx
 .Lecbdec1x:
-   addsw4, w4, #4
+   addsw23, w23, #4
beq .Lecbdecout
 .Lecbdecloop:
-   ld1 {v0.16b}, [x1], #16 /* get next ct block */
-   decrypt_block   v0, w3, x2, x5, w6
-   st1 {v0.16b}, [x0], #16
-   subsw4,

[PATCH v5 12/23] crypto: arm64/sha1-ce - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/sha1-ce-core.S | 42 ++--
 1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/crypto/sha1-ce-core.S b/arch/arm64/crypto/sha1-ce-core.S
index 46049850727d..78eb35fb5056 100644
--- a/arch/arm64/crypto/sha1-ce-core.S
+++ b/arch/arm64/crypto/sha1-ce-core.S
@@ -69,30 +69,36 @@
 *int blocks)
 */
 ENTRY(sha1_ce_transform)
+   frame_push  3
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+
/* load round constants */
-   loadrc  k0.4s, 0x5a827999, w6
+0: loadrc  k0.4s, 0x5a827999, w6
loadrc  k1.4s, 0x6ed9eba1, w6
loadrc  k2.4s, 0x8f1bbcdc, w6
loadrc  k3.4s, 0xca62c1d6, w6
 
/* load state */
-   ld1 {dgav.4s}, [x0]
-   ldr dgb, [x0, #16]
+   ld1 {dgav.4s}, [x19]
+   ldr dgb, [x19, #16]
 
/* load sha1_ce_state::finalize */
ldr_l   w4, sha1_ce_offsetof_finalize, x4
-   ldr w4, [x0, x4]
+   ldr w4, [x19, x4]
 
/* load input */
-0: ld1 {v8.4s-v11.4s}, [x1], #64
-   sub w2, w2, #1
+1: ld1 {v8.4s-v11.4s}, [x20], #64
+   sub w21, w21, #1
 
 CPU_LE(rev32   v8.16b, v8.16b  )
 CPU_LE(rev32   v9.16b, v9.16b  )
 CPU_LE(rev32   v10.16b, v10.16b)
 CPU_LE(rev32   v11.16b, v11.16b)
 
-1: add t0.4s, v8.4s, k0.4s
+2: add t0.4s, v8.4s, k0.4s
mov dg0v.16b, dgav.16b
 
add_update  c, ev, k0,  8,  9, 10, 11, dgb
@@ -123,16 +129,25 @@ CPU_LE(   rev32   v11.16b, v11.16b)
add dgbv.2s, dgbv.2s, dg1v.2s
add dgav.4s, dgav.4s, dg0v.4s
 
-   cbnzw2, 0b
+   cbz w21, 3f
+
+   if_will_cond_yield_neon
+   st1 {dgav.4s}, [x19]
+   str dgb, [x19, #16]
+   do_cond_yield_neon
+   b   0b
+   endif_yield_neon
+
+   b   1b
 
/*
 * Final block: add padding and total bit count.
 * Skip if the input size was not a round multiple of the block size,
 * the padding is handled by the C code in that case.
 */
-   cbz x4, 3f
+3: cbz x4, 4f
ldr_l   w4, sha1_ce_offsetof_count, x4
-   ldr x4, [x0, x4]
+   ldr x4, [x19, x4]
moviv9.2d, #0
mov x8, #0x8000
moviv10.2d, #0
@@ -141,10 +156,11 @@ CPU_LE(   rev32   v11.16b, v11.16b)
mov x4, #0
mov v11.d[0], xzr
mov v11.d[1], x7
-   b   1b
+   b   2b
 
/* store new state */
-3: st1 {dgav.4s}, [x0]
-   str dgb, [x0, #16]
+4: st1 {dgav.4s}, [x19]
+   str dgb, [x19, #16]
+   frame_pop
ret
 ENDPROC(sha1_ce_transform)
-- 
2.15.1

[PATCH v5 10/23] arm64: assembler: add utility macros to push/pop stack frames

2018-03-10 Thread Ard Biesheuvel

We are going to add code to all the NEON crypto routines that will
turn them into non-leaf functions, so we need to manage the stack
frames. To make this less tedious and error prone, add some macros
that take the number of callee saved registers to preserve and the
extra size to allocate in the stack frame (for locals) and emit
the ldp/stp sequences.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/include/asm/assembler.h | 70 
 1 file changed, 70 insertions(+)

diff --git a/arch/arm64/include/asm/assembler.h 
b/arch/arm64/include/asm/assembler.h
index 053d83e8db6f..eef1fd2c1c0b 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -555,6 +555,19 @@ USER(\label, icivau, \tmp2)// 
invalidate I line PoU
 #endif
.endm
 
+/*
+ * Errata workaround post TTBR0_EL1 update.
+ */
+   .macro  post_ttbr0_update_workaround
+#ifdef CONFIG_CAVIUM_ERRATUM_27456
+alternative_if ARM64_WORKAROUND_CAVIUM_27456
+   ic  iallu
+   dsb nsh
+   isb
+alternative_else_nop_endif
+#endif
+   .endm
+
 /**
  * Errata workaround prior to disable MMU. Insert an ISB immediately prior
  * to executing the MSR that will change SCTLR_ELn[M] from a value of 1 to 0.
@@ -565,4 +578,61 @@ USER(\label, icivau, \tmp2)// 
invalidate I line PoU
 #endif
.endm
 
+   /*
+* frame_push - Push @regcount callee saved registers to the stack,
+*  starting at x19, as well as x29/x30, and set x29 to
+*  the new value of sp. Add @extra bytes of stack space
+*  for locals.
+*/
+   .macro  frame_push, regcount:req, extra
+   __frame st, \regcount, \extra
+   .endm
+
+   /*
+* frame_pop  - Pop the callee saved registers from the stack that were
+*  pushed in the most recent call to frame_push, as well
+*  as x29/x30 and any extra stack space that may have been
+*  allocated.
+*/
+   .macro  frame_pop
+   __frame ld
+   .endm
+
+   .macro  __frame_regs, reg1, reg2, op, num
+   .if .Lframe_regcount == \num
+   \op\()r \reg1, [sp, #(\num + 1) * 8]
+   .elseif .Lframe_regcount > \num
+   \op\()p \reg1, \reg2, [sp, #(\num + 1) * 8]
+   .endif
+   .endm
+
+   .macro  __frame, op, regcount, extra=0
+   .ifc\op, st
+   .if (\regcount) < 0 || (\regcount) > 10
+   .error  "regcount should be in the range [0 ... 10]"
+   .endif
+   .if ((\extra) % 16) != 0
+   .error  "extra should be a multiple of 16 bytes"
+   .endif
+   .set.Lframe_regcount, \regcount
+   .set.Lframe_extra, \extra
+   .set.Lframe_local_offset, ((\regcount + 3) / 2) * 16
+   stp x29, x30, [sp, #-.Lframe_local_offset - .Lframe_extra]!
+   mov x29, sp
+   .elseif .Lframe_regcount == -1 // && op == 'ld'
+   .error  "frame_push/frame_pop may not be nested"
+   .endif
+
+   __frame_regsx19, x20, \op, 1
+   __frame_regsx21, x22, \op, 3
+   __frame_regsx23, x24, \op, 5
+   __frame_regsx25, x26, \op, 7
+   __frame_regsx27, x28, \op, 9
+
+   .ifc\op, ld
+   ldp x29, x30, [sp], #.Lframe_local_offset + .Lframe_extra
+   .set.Lframe_regcount, -1
+   .endif
+   .endm
+
 #endif /* __ASM_ASSEMBLER_H */
-- 
2.15.1

[PATCH v5 21/23] crypto: arm64/sha512-ce - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/sha512-ce-core.S | 27 +++-
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/crypto/sha512-ce-core.S 
b/arch/arm64/crypto/sha512-ce-core.S
index 7f3bca5c59a2..ce65e3abe4f2 100644
--- a/arch/arm64/crypto/sha512-ce-core.S
+++ b/arch/arm64/crypto/sha512-ce-core.S
@@ -107,17 +107,23 @@
 */
.text
 ENTRY(sha512_ce_transform)
+   frame_push  3
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+
/* load state */
-   ld1 {v8.2d-v11.2d}, [x0]
+0: ld1 {v8.2d-v11.2d}, [x19]
 
/* load first 4 round constants */
adr_l   x3, .Lsha512_rcon
ld1 {v20.2d-v23.2d}, [x3], #64
 
/* load input */
-0: ld1 {v12.2d-v15.2d}, [x1], #64
-   ld1 {v16.2d-v19.2d}, [x1], #64
-   sub w2, w2, #1
+1: ld1 {v12.2d-v15.2d}, [x20], #64
+   ld1 {v16.2d-v19.2d}, [x20], #64
+   sub w21, w21, #1
 
 CPU_LE(rev64   v12.16b, v12.16b)
 CPU_LE(rev64   v13.16b, v13.16b)
@@ -196,9 +202,18 @@ CPU_LE(rev64   v19.16b, v19.16b)
add v11.2d, v11.2d, v3.2d
 
/* handled all input blocks? */
-   cbnzw2, 0b
+   cbz w21, 3f
+
+   if_will_cond_yield_neon
+   st1 {v8.2d-v11.2d}, [x19]
+   do_cond_yield_neon
+   b   0b
+   endif_yield_neon
+
+   b   1b
 
/* store new state */
-3: st1 {v8.2d-v11.2d}, [x0]
+3: st1 {v8.2d-v11.2d}, [x19]
+   frame_pop
ret
 ENDPROC(sha512_ce_transform)
-- 
2.15.1

[PATCH v5 19/23] crypto: arm64/crct10dif-ce - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/crct10dif-ce-core.S | 32 +---
 1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/crypto/crct10dif-ce-core.S 
b/arch/arm64/crypto/crct10dif-ce-core.S
index f179c01bd55c..663ea71cdb38 100644
--- a/arch/arm64/crypto/crct10dif-ce-core.S
+++ b/arch/arm64/crypto/crct10dif-ce-core.S
@@ -74,13 +74,19 @@
.text
.cpugeneric+crypto
 
-   arg1_low32  .reqw0
-   arg2.reqx1
-   arg3.reqx2
+   arg1_low32  .reqw19
+   arg2.reqx20
+   arg3.reqx21
 
vzr .reqv13
 
 ENTRY(crc_t10dif_pmull)
+   frame_push  3, 128
+
+   mov arg1_low32, w0
+   mov arg2, x1
+   mov arg3, x2
+
movivzr.16b, #0 // init zero register
 
// adjust the 16-bit initial_crc value, scale it to 32 bits
@@ -175,8 +181,25 @@ CPU_LE(ext v12.16b, v12.16b, v12.16b, #8   
)
subsarg3, arg3, #128
 
// check if there is another 64B in the buffer to be able to fold
-   b.ge_fold_64_B_loop
+   b.lt_fold_64_B_end
+
+   if_will_cond_yield_neon
+   stp q0, q1, [sp, #.Lframe_local_offset]
+   stp q2, q3, [sp, #.Lframe_local_offset + 32]
+   stp q4, q5, [sp, #.Lframe_local_offset + 64]
+   stp q6, q7, [sp, #.Lframe_local_offset + 96]
+   do_cond_yield_neon
+   ldp q0, q1, [sp, #.Lframe_local_offset]
+   ldp q2, q3, [sp, #.Lframe_local_offset + 32]
+   ldp q4, q5, [sp, #.Lframe_local_offset + 64]
+   ldp q6, q7, [sp, #.Lframe_local_offset + 96]
+   ldr_l   q10, rk3, x8
+   movivzr.16b, #0 // init zero register
+   endif_yield_neon
+
+   b   _fold_64_B_loop
 
+_fold_64_B_end:
// at this point, the buffer pointer is pointing at the last y Bytes
// of the buffer the 64B of folded data is in 4 of the vector
// registers: v0, v1, v2, v3
@@ -304,6 +327,7 @@ _barrett:
 _cleanup:
// scale the result back to 16 bits
lsr x0, x0, #16
+   frame_pop
ret
 
 _less_than_128:
-- 
2.15.1

[PATCH v5 17/23] crypto: arm64/aes-ghash - yield NEON after every block of input

2018-03-10 Thread Ard Biesheuvel

Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/ghash-ce-core.S | 113 ++--
 arch/arm64/crypto/ghash-ce-glue.c |  28 +++--
 2 files changed, 97 insertions(+), 44 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-core.S 
b/arch/arm64/crypto/ghash-ce-core.S
index 11ebf1ae248a..dcffb9e77589 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -213,22 +213,31 @@
.endm
 
.macro  __pmull_ghash, pn
-   ld1 {SHASH.2d}, [x3]
-   ld1 {XL.2d}, [x1]
+   frame_push  5
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+
+0: ld1 {SHASH.2d}, [x22]
+   ld1 {XL.2d}, [x20]
ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
eor SHASH2.16b, SHASH2.16b, SHASH.16b
 
__pmull_pre_\pn
 
/* do the head block first, if supplied */
-   cbz x4, 0f
-   ld1 {T1.2d}, [x4]
-   b   1f
+   cbz x23, 1f
+   ld1 {T1.2d}, [x23]
+   mov x23, xzr
+   b   2f
 
-0: ld1 {T1.2d}, [x2], #16
-   sub w0, w0, #1
+1: ld1 {T1.2d}, [x21], #16
+   sub w19, w19, #1
 
-1: /* multiply XL by SHASH in GF(2^128) */
+2: /* multiply XL by SHASH in GF(2^128) */
 CPU_LE(rev64   T1.16b, T1.16b  )
 
ext T2.16b, XL.16b, XL.16b, #8
@@ -250,9 +259,18 @@ CPU_LE(rev64   T1.16b, T1.16b  )
eor T2.16b, T2.16b, XH.16b
eor XL.16b, XL.16b, T2.16b
 
-   cbnzw0, 0b
+   cbz w19, 3f
+
+   if_will_cond_yield_neon
+   st1 {XL.2d}, [x20]
+   do_cond_yield_neon
+   b   0b
+   endif_yield_neon
+
+   b   1b
 
-   st1 {XL.2d}, [x1]
+3: st1 {XL.2d}, [x20]
+   frame_pop
ret
.endm
 
@@ -304,38 +322,55 @@ ENDPROC(pmull_ghash_update_p8)
.endm
 
.macro  pmull_gcm_do_crypt, enc
-   ld1 {SHASH.2d}, [x4]
-   ld1 {XL.2d}, [x1]
-   ldr x8, [x5, #8]// load lower counter
+   frame_push  10
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+   mov x24, x5
+   mov x25, x6
+   mov x26, x7
+   .if \enc == 1
+   ldr x27, [sp, #96]  // first stacked arg
+   .endif
+
+   ldr x28, [x24, #8]  // load lower counter
+CPU_LE(rev x28, x28)
+
+0: mov x0, x25
+   load_round_keys w26, x0
+   ld1 {SHASH.2d}, [x23]
+   ld1 {XL.2d}, [x20]
 
moviMASK.16b, #0xe1
ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
-CPU_LE(rev x8, x8  )
shl MASK.2d, MASK.2d, #57
eor SHASH2.16b, SHASH2.16b, SHASH.16b
 
.if \enc == 1
-   ld1 {KS.16b}, [x7]
+   ld1 {KS.16b}, [x27]
.endif
 
-0: ld1 {CTR.8b}, [x5]  // load upper counter
-   ld1 {INP.16b}, [x3], #16
-   rev x9, x8
-   add x8, x8, #1
-   sub w0, w0, #1
+1: ld1 {CTR.8b}, [x24] // load upper counter
+   ld1 {INP.16b}, [x22], #16
+   rev x9, x28
+   add x28, x28, #1
+   sub w19, w19, #1
ins CTR.d[1], x9// set lower counter
 
.if \enc == 1
eor INP.16b, INP.16b, KS.16b// encrypt input
-   st1 {INP.16b}, [x2], #16
+   st1 {INP.16b}, [x21], #16
.endif
 
rev64   T1.16b, INP.16b
 
-   cmp w6, #12
-   b.ge2f  // AES-192/256?
+   cmp w26, #12
+   b.ge4f  // AES-192/256?
 
-1: enc_round   CTR, v21
+2: enc_round   CTR, v21
 
ext T2.16b, XL.16b, XL.16b, #8
ext IN1.16b, T1.16b, T1.16b, #8
@@ -390,27 +425,39 @@ CPU_LE(   rev x8, x8  )
 
.if \enc == 0
eor INP.16b, INP.16b, KS.16b
-

[PATCH v5 01/23] crypto: testmgr - add a new test case for CRC-T10DIF

2018-03-10 Thread Ard Biesheuvel

In order to be able to test yield support under preempt, add a test
vector for CRC-T10DIF that is long enough to take multiple iterations
(and thus possible preemption between them) of the primary loop of the
accelerated x86 and arm64 implementations.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 crypto/testmgr.h | 259 
 1 file changed, 259 insertions(+)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 6044f6906bd6..52d9ff93beac 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -2044,6 +2044,265 @@ static const struct hash_testvec 
crct10dif_tv_template[] = {
.digest = (u8 *)(u16 []){ 0x44c6 },
.np = 4,
.tap= { 1, 255, 57, 6 },
+   }, {
+   .plaintext ="\x6e\x05\x79\x10\xa7\x1b\xb2\x49"
+   "\xe0\x54\xeb\x82\x19\x8d\x24\xbb"
+   "\x2f\xc6\x5d\xf4\x68\xff\x96\x0a"
+   "\xa1\x38\xcf\x43\xda\x71\x08\x7c"
+   "\x13\xaa\x1e\xb5\x4c\xe3\x57\xee"
+   "\x85\x1c\x90\x27\xbe\x32\xc9\x60"
+   "\xf7\x6b\x02\x99\x0d\xa4\x3b\xd2"
+   "\x46\xdd\x74\x0b\x7f\x16\xad\x21"
+   "\xb8\x4f\xe6\x5a\xf1\x88\x1f\x93"
+   "\x2a\xc1\x35\xcc\x63\xfa\x6e\x05"
+   "\x9c\x10\xa7\x3e\xd5\x49\xe0\x77"
+   "\x0e\x82\x19\xb0\x24\xbb\x52\xe9"
+   "\x5d\xf4\x8b\x22\x96\x2d\xc4\x38"
+   "\xcf\x66\xfd\x71\x08\x9f\x13\xaa"
+   "\x41\xd8\x4c\xe3\x7a\x11\x85\x1c"
+   "\xb3\x27\xbe\x55\xec\x60\xf7\x8e"
+   "\x02\x99\x30\xc7\x3b\xd2\x69\x00"
+   "\x74\x0b\xa2\x16\xad\x44\xdb\x4f"
+   "\xe6\x7d\x14\x88\x1f\xb6\x2a\xc1"
+   "\x58\xef\x63\xfa\x91\x05\x9c\x33"
+   "\xca\x3e\xd5\x6c\x03\x77\x0e\xa5"
+   "\x19\xb0\x47\xde\x52\xe9\x80\x17"
+   "\x8b\x22\xb9\x2d\xc4\x5b\xf2\x66"
+   "\xfd\x94\x08\x9f\x36\xcd\x41\xd8"
+   "\x6f\x06\x7a\x11\xa8\x1c\xb3\x4a"
+   "\xe1\x55\xec\x83\x1a\x8e\x25\xbc"
+   "\x30\xc7\x5e\xf5\x69\x00\x97\x0b"
+   "\xa2\x39\xd0\x44\xdb\x72\x09\x7d"
+   "\x14\xab\x1f\xb6\x4d\xe4\x58\xef"
+   "\x86\x1d\x91\x28\xbf\x33\xca\x61"
+   "\xf8\x6c\x03\x9a\x0e\xa5\x3c\xd3"
+   "\x47\xde\x75\x0c\x80\x17\xae\x22"
+   "\xb9\x50\xe7\x5b\xf2\x89\x20\x94"
+   "\x2b\xc2\x36\xcd\x64\xfb\x6f\x06"
+   "\x9d\x11\xa8\x3f\xd6\x4a\xe1\x78"
+   "\x0f\x83\x1a\xb1\x25\xbc\x53\xea"
+   "\x5e\xf5\x8c\x00\x97\x2e\xc5\x39"
+   "\xd0\x67\xfe\x72\x09\xa0\x14\xab"
+   "\x42\xd9\x4d\xe4\x7b\x12\x86\x1d"
+   "\xb4\x28\xbf\x56\xed\x61\xf8\x8f"
+   "\x03\x9a\x31\xc8\x3c\xd3\x6a\x01"
+   "\x75\x0c\xa3\x17\xae\x45\xdc\x50"
+   "\xe7\x7e\x15\x89\x20\xb7\x2b\xc2"
+   "\x59\xf0\x64\xfb\x92\x06\x9d\x34"
+   "\xcb\x3f\xd6\x6d\x04\x78\x0f\xa6"
+   "\x1a\xb1\x48\xdf\x53\xea\x81\x18"
+   "\x8c\x23\xba\x2e\xc5\x5c\xf3\x67"
+   "\xfe\x95\x09\xa0\x37\xce\x42\xd9"
+   "\x70\x07\x7b\x12\xa9\x1d\xb4\x4b"
+   "\xe2\x56\xed\x84\x1b\x8f\x26\xbd"
+   "\x31\xc8\x5f\xf6\x6a\x01\x98\x0c"
+   "\xa3\x3a\xd1\x45\xdc\x73\x0a\x7e"
+   "\x15\xac\x20\xb7\x4e\xe5\x59\xf0"
+   "\x87\x1e\x92\x29\xc0\x34\xcb\x62"
+   "\xf9\x6d\x04\

[PATCH v5 02/23] crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop

2018-03-10 Thread Ard Biesheuvel

When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/aes-ce-ccm-glue.c | 47 ++--
 1 file changed, 23 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c 
b/arch/arm64/crypto/aes-ce-ccm-glue.c
index a1254036f2b1..68b11aa690e4 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -107,11 +107,13 @@ static int ccm_init_mac(struct aead_request *req, u8 
maciv[], u32 msglen)
 }
 
 static void ccm_update_mac(struct crypto_aes_ctx *key, u8 mac[], u8 const in[],
-  u32 abytes, u32 *macp, bool use_neon)
+  u32 abytes, u32 *macp)
 {
-   if (likely(use_neon)) {
+   if (may_use_simd()) {
+   kernel_neon_begin();
ce_aes_ccm_auth_data(mac, in, abytes, macp, key->key_enc,
 num_rounds(key));
+   kernel_neon_end();
} else {
if (*macp > 0 && *macp < AES_BLOCK_SIZE) {
int added = min(abytes, AES_BLOCK_SIZE - *macp);
@@ -143,8 +145,7 @@ static void ccm_update_mac(struct crypto_aes_ctx *key, u8 
mac[], u8 const in[],
}
 }
 
-static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
-  bool use_neon)
+static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
 {
struct crypto_aead *aead = crypto_aead_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_aead_ctx(aead);
@@ -163,7 +164,7 @@ static void ccm_calculate_auth_mac(struct aead_request 
*req, u8 mac[],
ltag.len = 6;
}
 
-   ccm_update_mac(ctx, mac, (u8 *), ltag.len, , use_neon);
+   ccm_update_mac(ctx, mac, (u8 *), ltag.len, );
scatterwalk_start(, req->src);
 
do {
@@ -175,7 +176,7 @@ static void ccm_calculate_auth_mac(struct aead_request 
*req, u8 mac[],
n = scatterwalk_clamp(, len);
}
p = scatterwalk_map();
-   ccm_update_mac(ctx, mac, p, n, , use_neon);
+   ccm_update_mac(ctx, mac, p, n, );
len -= n;
 
scatterwalk_unmap(p);
@@ -242,43 +243,42 @@ static int ccm_encrypt(struct aead_request *req)
u8 __aligned(8) mac[AES_BLOCK_SIZE];
u8 buf[AES_BLOCK_SIZE];
u32 len = req->cryptlen;
-   bool use_neon = may_use_simd();
int err;
 
err = ccm_init_mac(req, mac, len);
if (err)
return err;
 
-   if (likely(use_neon))
-   kernel_neon_begin();
-
if (req->assoclen)
-   ccm_calculate_auth_mac(req, mac, use_neon);
+   ccm_calculate_auth_mac(req, mac);
 
/* preserve the original iv for the final round */
memcpy(buf, req->iv, AES_BLOCK_SIZE);
 
err = skcipher_walk_aead_encrypt(, req, true);
 
-   if (likely(use_neon)) {
+   if (may_use_simd()) {
while (walk.nbytes) {
u32 tail = walk.nbytes % AES_BLOCK_SIZE;
 
if (walk.nbytes == walk.total)
tail = 0;
 
+   kernel_neon_begin();
ce_aes_ccm_encrypt(walk.dst.virt.addr,
   walk.src.virt.addr,
   walk.nbytes - tail, ctx->key_enc,
   num_rounds(ctx), mac, walk.iv);
+   kernel_neon_end();
 
err = skcipher_walk_done(, tail);
}
-   if (!err)
+   if (!err) {
+   kernel_neon_begin();
ce_aes_ccm_final(mac, buf, ctx->key_enc,
 num_rounds(ctx));
-
-   kernel_neon_end();
+   kernel_neon_end();
+

[PATCH v5 04/23] crypto: arm64/aes-bs - move kernel mode neon en/disable into loop

2018-03-10 Thread Ard Biesheuvel

When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/aes-neonbs-glue.c | 36 +---
 1 file changed, 17 insertions(+), 19 deletions(-)

diff --git a/arch/arm64/crypto/aes-neonbs-glue.c 
b/arch/arm64/crypto/aes-neonbs-glue.c
index 9d823c77ec84..e7a95a566462 100644
--- a/arch/arm64/crypto/aes-neonbs-glue.c
+++ b/arch/arm64/crypto/aes-neonbs-glue.c
@@ -99,9 +99,8 @@ static int __ecb_crypt(struct skcipher_request *req,
struct skcipher_walk walk;
int err;
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
-   kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
 
@@ -109,12 +108,13 @@ static int __ecb_crypt(struct skcipher_request *req,
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
 
+   kernel_neon_begin();
fn(walk.dst.virt.addr, walk.src.virt.addr, ctx->rk,
   ctx->rounds, blocks);
+   kernel_neon_end();
err = skcipher_walk_done(,
 walk.nbytes - blocks * AES_BLOCK_SIZE);
}
-   kernel_neon_end();
 
return err;
 }
@@ -158,19 +158,19 @@ static int cbc_encrypt(struct skcipher_request *req)
struct skcipher_walk walk;
int err;
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
-   kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
 
/* fall back to the non-bitsliced NEON implementation */
+   kernel_neon_begin();
neon_aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
 ctx->enc, ctx->key.rounds, blocks,
 walk.iv);
+   kernel_neon_end();
err = skcipher_walk_done(, walk.nbytes % AES_BLOCK_SIZE);
}
-   kernel_neon_end();
return err;
 }
 
@@ -181,9 +181,8 @@ static int cbc_decrypt(struct skcipher_request *req)
struct skcipher_walk walk;
int err;
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
-   kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
 
@@ -191,13 +190,14 @@ static int cbc_decrypt(struct skcipher_request *req)
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
 
+   kernel_neon_begin();
aesbs_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
  ctx->key.rk, ctx->key.rounds, blocks,
  walk.iv);
+   kernel_neon_end();
err = skcipher_walk_done(,
 walk.nbytes - blocks * AES_BLOCK_SIZE);
}
-   kernel_neon_end();
 
return err;
 }
@@ -229,9 +229,8 @@ static int ctr_encrypt(struct skcipher_request *req)
u8 buf[AES_BLOCK_SIZE];
int err;
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
-   kernel_neon_begin();
while (walk.nbytes > 0) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
u8 *final = (walk.total % AES_BLOCK_SIZE) ? buf : NULL;
@@ -242,8 +241,10 @@ static int ctr_encrypt(struct skcipher_request *req)
final = NULL;
}
 
+   kernel_neon_begin();
aesbs_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
  ctx->rk, ctx->rounds, blocks, walk.i

[PATCH v5 03/23] crypto: arm64/aes-blk - move kernel mode neon en/disable into loop

2018-03-10 Thread Ard Biesheuvel

When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/aes-glue.c| 95 ++--
 arch/arm64/crypto/aes-modes.S   | 90 +--
 arch/arm64/crypto/aes-neonbs-glue.c | 14 ++-
 3 files changed, 97 insertions(+), 102 deletions(-)

diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 2fa850e86aa8..253188fb8cb0 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -64,17 +64,17 @@ MODULE_LICENSE("GPL v2");
 
 /* defined in aes-modes.S */
 asmlinkage void aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[],
-   int rounds, int blocks, int first);
+   int rounds, int blocks);
 asmlinkage void aes_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[],
-   int rounds, int blocks, int first);
+   int rounds, int blocks);
 
 asmlinkage void aes_cbc_encrypt(u8 out[], u8 const in[], u8 const rk[],
-   int rounds, int blocks, u8 iv[], int first);
+   int rounds, int blocks, u8 iv[]);
 asmlinkage void aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[],
-   int rounds, int blocks, u8 iv[], int first);
+   int rounds, int blocks, u8 iv[]);
 
 asmlinkage void aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[],
-   int rounds, int blocks, u8 ctr[], int first);
+   int rounds, int blocks, u8 ctr[]);
 
 asmlinkage void aes_xts_encrypt(u8 out[], u8 const in[], u8 const rk1[],
int rounds, int blocks, u8 const rk2[], u8 iv[],
@@ -133,19 +133,19 @@ static int ecb_encrypt(struct skcipher_request *req)
 {
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
-   int err, first, rounds = 6 + ctx->key_length / 4;
+   int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
-   kernel_neon_begin();
-   for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+   while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+   kernel_neon_begin();
aes_ecb_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
-   (u8 *)ctx->key_enc, rounds, blocks, first);
+   (u8 *)ctx->key_enc, rounds, blocks);
+   kernel_neon_end();
err = skcipher_walk_done(, walk.nbytes % AES_BLOCK_SIZE);
}
-   kernel_neon_end();
return err;
 }
 
@@ -153,19 +153,19 @@ static int ecb_decrypt(struct skcipher_request *req)
 {
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
-   int err, first, rounds = 6 + ctx->key_length / 4;
+   int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
-   kernel_neon_begin();
-   for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+   while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+   kernel_neon_begin();
aes_ecb_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
-   (u8 *)ctx->key_dec, rounds, blocks, first);
+   (u8 *)ctx->key_dec, rounds, blocks);
+   kernel_neon_end();

[PATCH v5 07/23] crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path

2018-03-10 Thread Ard Biesheuvel

CBC encryption is strictly sequential, and so the current AES code
simply processes the input one block at a time. However, we are
about to add yield support, which adds a bit of overhead, and which
we prefer to align with other modes in terms of granularity (i.e.,
it is better to have all routines yield every 64 bytes and not have
an exception for CBC encrypt which yields every 16 bytes)

So unroll the loop by 4. We still cannot perform the AES algorithm in
parallel, but we can at least merge the loads and stores.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/aes-modes.S | 31 
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 27a235b2ddee..e86535a1329d 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -94,17 +94,36 @@ AES_ENDPROC(aes_ecb_decrypt)
 */
 
 AES_ENTRY(aes_cbc_encrypt)
-   ld1 {v0.16b}, [x5]  /* get iv */
+   ld1 {v4.16b}, [x5]  /* get iv */
enc_prepare w3, x2, x6
 
-.Lcbcencloop:
-   ld1 {v1.16b}, [x1], #16 /* get next pt block */
-   eor v0.16b, v0.16b, v1.16b  /* ..and xor with iv */
+.Lcbcencloop4x:
+   subsw4, w4, #4
+   bmi .Lcbcenc1x
+   ld1 {v0.16b-v3.16b}, [x1], #64  /* get 4 pt blocks */
+   eor v0.16b, v0.16b, v4.16b  /* ..and xor with iv */
encrypt_block   v0, w3, x2, x6, w7
-   st1 {v0.16b}, [x0], #16
+   eor v1.16b, v1.16b, v0.16b
+   encrypt_block   v1, w3, x2, x6, w7
+   eor v2.16b, v2.16b, v1.16b
+   encrypt_block   v2, w3, x2, x6, w7
+   eor v3.16b, v3.16b, v2.16b
+   encrypt_block   v3, w3, x2, x6, w7
+   st1 {v0.16b-v3.16b}, [x0], #64
+   mov v4.16b, v3.16b
+   b   .Lcbcencloop4x
+.Lcbcenc1x:
+   addsw4, w4, #4
+   beq .Lcbcencout
+.Lcbcencloop:
+   ld1 {v0.16b}, [x1], #16 /* get next pt block */
+   eor v4.16b, v4.16b, v0.16b  /* ..and xor with iv */
+   encrypt_block   v4, w3, x2, x6, w7
+   st1 {v4.16b}, [x0], #16
subsw4, w4, #1
bne .Lcbcencloop
-   st1 {v0.16b}, [x5]  /* return iv */
+.Lcbcencout:
+   st1 {v4.16b}, [x5]  /* return iv */
ret
 AES_ENDPROC(aes_cbc_encrypt)
 
-- 
2.15.1

[PATCH v5 06/23] crypto: arm64/aes-blk - remove configurable interleave

2018-03-10 Thread Ard Biesheuvel

The AES block mode implementation using Crypto Extensions or plain NEON
was written before real hardware existed, and so its interleave factor
was made build time configurable (as well as an option to instantiate
all interleaved sequences inline rather than as subroutines)

We ended up using INTERLEAVE=4 with inlining disabled for both flavors
of the core AES routines, so let's stick with that, and remove the option
to configure this at build time. This makes the code easier to modify,
which is nice now that we're adding yield support.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/Makefile|   3 -
 arch/arm64/crypto/aes-modes.S | 237 
 2 files changed, 40 insertions(+), 200 deletions(-)

diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index cee9b8d9830b..b6b624602582 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -59,9 +59,6 @@ aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
 obj-$(CONFIG_CRYPTO_AES_ARM64_BS) += aes-neon-bs.o
 aes-neon-bs-y := aes-neonbs-core.o aes-neonbs-glue.o
 
-AFLAGS_aes-ce.o:= -DINTERLEAVE=4
-AFLAGS_aes-neon.o  := -DINTERLEAVE=4
-
 CFLAGS_aes-glue-ce.o   := -DUSE_V8_CRYPTO_EXTENSIONS
 
 $(obj)/aes-glue-%.o: $(src)/aes-glue.c FORCE
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 65b273667b34..27a235b2ddee 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -13,44 +13,6 @@
.text
.align  4
 
-/*
- * There are several ways to instantiate this code:
- * - no interleave, all inline
- * - 2-way interleave, 2x calls out of line (-DINTERLEAVE=2)
- * - 2-way interleave, all inline (-DINTERLEAVE=2 -DINTERLEAVE_INLINE)
- * - 4-way interleave, 4x calls out of line (-DINTERLEAVE=4)
- * - 4-way interleave, all inline (-DINTERLEAVE=4 -DINTERLEAVE_INLINE)
- *
- * Macros imported by this code:
- * - enc_prepare   - setup NEON registers for encryption
- * - dec_prepare   - setup NEON registers for decryption
- * - enc_switch_key- change to new key after having prepared for encryption
- * - encrypt_block - encrypt a single block
- * - decrypt block - decrypt a single block
- * - encrypt_block2x   - encrypt 2 blocks in parallel (if INTERLEAVE == 2)
- * - decrypt_block2x   - decrypt 2 blocks in parallel (if INTERLEAVE == 2)
- * - encrypt_block4x   - encrypt 4 blocks in parallel (if INTERLEAVE == 4)
- * - decrypt_block4x   - decrypt 4 blocks in parallel (if INTERLEAVE == 4)
- */
-
-#if defined(INTERLEAVE) && !defined(INTERLEAVE_INLINE)
-#define FRAME_PUSH stp x29, x30, [sp,#-16]! ; mov x29, sp
-#define FRAME_POP  ldp x29, x30, [sp],#16
-
-#if INTERLEAVE == 2
-
-aes_encrypt_block2x:
-   encrypt_block2x v0, v1, w3, x2, x8, w7
-   ret
-ENDPROC(aes_encrypt_block2x)
-
-aes_decrypt_block2x:
-   decrypt_block2x v0, v1, w3, x2, x8, w7
-   ret
-ENDPROC(aes_decrypt_block2x)
-
-#elif INTERLEAVE == 4
-
 aes_encrypt_block4x:
encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
@@ -61,48 +23,6 @@ aes_decrypt_block4x:
ret
 ENDPROC(aes_decrypt_block4x)
 
-#else
-#error INTERLEAVE should equal 2 or 4
-#endif
-
-   .macro  do_encrypt_block2x
-   bl  aes_encrypt_block2x
-   .endm
-
-   .macro  do_decrypt_block2x
-   bl  aes_decrypt_block2x
-   .endm
-
-   .macro  do_encrypt_block4x
-   bl  aes_encrypt_block4x
-   .endm
-
-   .macro  do_decrypt_block4x
-   bl  aes_decrypt_block4x
-   .endm
-
-#else
-#define FRAME_PUSH
-#define FRAME_POP
-
-   .macro  do_encrypt_block2x
-   encrypt_block2x v0, v1, w3, x2, x8, w7
-   .endm
-
-   .macro  do_decrypt_block2x
-   decrypt_block2x v0, v1, w3, x2, x8, w7
-   .endm
-
-   .macro  do_encrypt_block4x
-   encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
-   .endm
-
-   .macro  do_decrypt_block4x
-   decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
-   .endm
-
-#endif
-
/*
 * aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
 * int blocks)
@@ -111,28 +31,21 @@ ENDPROC(aes_decrypt_block4x)
 */
 
 AES_ENTRY(aes_ecb_encrypt)
-   FRAME_PUSH
+   stp x29, x30, [sp, #-16]!
+   mov x29, sp
 
enc_prepare w3, x2, x5
 
 .LecbencloopNx:
-#if INTERLEAVE >= 2
-   subsw4, w4, #INTERLEAVE
+   subsw4, w4, #4
bmi .Lecbenc1x
-#if INTERLEAVE == 2
-   ld1 {v0.16b-v1.16b}, [x1], #32  /* get 2 pt blocks */
-   do_encrypt_block2x
-   st1 {v0.16b-v1.16b}, [x0], #32
-#else
ld1 {v0.16b-v3.16b}, [x1], #64  /* get 4 pt blocks */
-   do_encrypt_block4x
+   bl  aes_encrypt_bl

[PATCH v5 05/23] crypto: arm64/chacha20 - move kernel mode neon en/disable into loop

2018-03-10 Thread Ard Biesheuvel

When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/chacha20-neon-glue.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/crypto/chacha20-neon-glue.c 
b/arch/arm64/crypto/chacha20-neon-glue.c
index cbdb75d15cd0..727579c93ded 100644
--- a/arch/arm64/crypto/chacha20-neon-glue.c
+++ b/arch/arm64/crypto/chacha20-neon-glue.c
@@ -37,12 +37,19 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 
*src,
u8 buf[CHACHA20_BLOCK_SIZE];
 
while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
+   kernel_neon_begin();
chacha20_4block_xor_neon(state, dst, src);
+   kernel_neon_end();
bytes -= CHACHA20_BLOCK_SIZE * 4;
src += CHACHA20_BLOCK_SIZE * 4;
dst += CHACHA20_BLOCK_SIZE * 4;
state[12] += 4;
}
+
+   if (!bytes)
+   return;
+
+   kernel_neon_begin();
while (bytes >= CHACHA20_BLOCK_SIZE) {
chacha20_block_xor_neon(state, dst, src);
bytes -= CHACHA20_BLOCK_SIZE;
@@ -55,6 +62,7 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 
*src,
chacha20_block_xor_neon(state, buf, buf);
memcpy(dst, buf, bytes);
}
+   kernel_neon_end();
 }
 
 static int chacha20_neon(struct skcipher_request *req)
@@ -68,11 +76,10 @@ static int chacha20_neon(struct skcipher_request *req)
if (!may_use_simd() || req->cryptlen <= CHACHA20_BLOCK_SIZE)
return crypto_chacha20_crypt(req);
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
crypto_chacha20_init(state, ctx, walk.iv);
 
-   kernel_neon_begin();
while (walk.nbytes > 0) {
unsigned int nbytes = walk.nbytes;
 
@@ -83,7 +90,6 @@ static int chacha20_neon(struct skcipher_request *req)
nbytes);
err = skcipher_walk_done(, walk.nbytes - nbytes);
}
-   kernel_neon_end();
 
return err;
 }
-- 
2.15.1

[PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT

2018-03-10 Thread Ard Biesheuvel

As reported by Sebastian, the way the arm64 NEON crypto code currently
keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
causing problems with RT builds, given that the skcipher walk API may
allocate and free temporary buffers it uses to present the input and
output arrays to the crypto algorithm in blocksize sized chunks (where
blocksize is the natural blocksize of the crypto algorithm), and doing
so with NEON enabled means we're alloc/free'ing memory with preemption
disabled.

This was deliberate: when this code was introduced, each kernel_neon_begin()
and kernel_neon_end() call incurred a fixed penalty of storing resp.
loading the contents of all NEON registers to/from memory, and so doing
it less often had an obvious performance benefit. However, in the mean time,
we have refactored the core kernel mode NEON code, and now kernel_neon_begin()
only incurs this penalty the first time it is called after entering the kernel,
and the NEON register restore is deferred until returning to userland. This
means pulling those calls into the loops that iterate over the input/output
of the crypto algorithm is not a big deal anymore (although there are some
places in the code where we relied on the NEON registers retaining their
values between calls)

So let's clean this up for arm64: update the NEON based skcipher drivers to
no longer keep the NEON enabled when calling into the skcipher walk API.

As pointed out by Peter, this only solves part of the problem. So let's
tackle it more thoroughly, and update the algorithms to test the NEED_RESCHED
flag each time after processing a fixed chunk of input.

Given that this issue was flagged by the RT people, I would appreciate it
if they could confirm whether they are happy with this approach.

Changes since v4:
- rebase onto v4.16-rc3
- apply the same treatment to new SHA512, SHA-3 and SM3 code that landed
  in v4.16-rc1

Changes since v3:
- incorporate Dave's feedback on the asm macros to push/pop frames and to yield
  the NEON conditionally
- make frame_push/pop more easy to use, by recording the arguments to
  frame_push, removing the need to specify them again when calling frame_pop
- emit local symbol .Lframe_local_offset to allow code using the frame push/pop
  macros to index the stack more easily
- use the magic \@ macro invocation counter provided by GAS to generate unique
  labels om the NEON yield macros, rather than relying on chance

Changes since v2:
- Drop logic to yield only after so many blocks - as it turns out, the
  throughput of the algorithms that are most likely to be affected by the
  overhead (GHASH and AES-CE) only drops by ~1% (on Cortex-A57), and if that
  is inacceptable, you are probably not using CONFIG_PREEMPT in the first
  place.
- Add yield support to the AES-CCM driver
- Clean up macros based on feedback from Dave
- Given that I had to add stack frame logic to many of these functions, factor
  it out and wrap it in a couple of macros
- Merge the changes to the core asm driver and glue code of the GHASH/GCM
  driver. The latter was not correct without the former.

Changes since v1:
- add CRC-T10DIF test vector (#1)
- stop using GFP_ATOMIC in scatterwalk API calls, now that they are executed
  with preemption enabled (#2 - #6)
- do some preparatory refactoring on the AES block mode code (#7 - #9)
- add yield patches (#10 - #18)
- add test patch (#19) - DO NOT MERGE

Cc: Dave Martin <dave.mar...@arm.com>
Cc: Russell King - ARM Linux <li...@armlinux.org.uk>
Cc: Sebastian Andrzej Siewior <bige...@linutronix.de>
Cc: Mark Rutland <mark.rutl...@arm.com>
Cc: linux-rt-us...@vger.kernel.org
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Catalin Marinas <catalin.mari...@arm.com>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Steven Rostedt <rost...@goodmis.org>
Cc: Thomas Gleixner <t...@linutronix.de>

Ard Biesheuvel (23):
  crypto: testmgr - add a new test case for CRC-T10DIF
  crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
  crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
  crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
  crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
  crypto: arm64/aes-blk - remove configurable interleave
  crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path
  crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path
  crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels
  arm64: assembler: add utility macros to push/pop stack frames
  arm64: assembler: add macros to conditionally yield the NEON under
PREEMPT
  crypto: arm64/sha1-ce - yield NEON after every block of input
  crypto: arm64/sha2-ce - yield NEON after every block of input
  crypto: arm64/aes-ccm - yield NEON after every block of input
  crypto: arm64/aes-blk - yield NEON after every block of input
  crypto: arm64/aes-bs - yield NEON after every block of input
  crypto: a

Re: [RFC PATCH] crypto: arm64/speck - add NEON-accelerated implementation of Speck-XTS

2018-03-06 Thread Ard Biesheuvel

On 6 March 2018 at 12:35, Dave Martin  wrote:
> On Mon, Mar 05, 2018 at 11:17:07AM -0800, Eric Biggers wrote:
>> Add a NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
>> for ARM64.  This is ported from the 32-bit version.  It may be useful on
>> devices with 64-bit ARM CPUs that don't have the Cryptography
>> Extensions, so cannot do AES efficiently -- e.g. the Cortex-A53
>> processor on the Raspberry Pi 3.
>>
>> It generally works the same way as the 32-bit version, but there are
>> some slight differences due to the different instructions, registers,
>> and syntax available in ARM64 vs. in ARM32.  For example, in the 64-bit
>> version there are enough registers to hold the XTS tweaks for each
>> 128-byte chunk, so they don't need to be saved on the stack.
>>
>> Benchmarks on a Raspberry Pi 3 running a 64-bit kernel:
>>
>>Algorithm  Encryption Decryption
>>-  -- --
>>Speck64/128-XTS (NEON) 92.2 MB/s  92.2 MB/s
>>Speck128/256-XTS (NEON)75.0 MB/s  75.0 MB/s
>>Speck128/256-XTS (generic) 47.4 MB/s  35.6 MB/s
>>AES-128-XTS (NEON bit-sliced)  33.4 MB/s  29.6 MB/s
>>AES-256-XTS (NEON bit-sliced)  24.6 MB/s  21.7 MB/s
>>
>> The code performs well on higher-end ARM64 processors as well, though
>> such processors tend to have the Crypto Extensions which make AES
>> preferred.  For example, here are the same benchmarks run on a HiKey960
>> (with CPU affinity set for the A73 cores), with the Crypto Extensions
>> implementation of AES-256-XTS added:
>>
>>Algorithm  Encryption Decryption
>>-  ------
>>AES-256-XTS (Crypto Extensions)1273.3 MB/s1274.7 MB/s
>>Speck64/128-XTS (NEON)  359.8 MB/s 348.0 MB/s
>>Speck128/256-XTS (NEON) 292.5 MB/s 286.1 MB/s
>>Speck128/256-XTS (generic)  186.3 MB/s 181.8 MB/s
>>AES-128-XTS (NEON bit-sliced)   142.0 MB/s 124.3 MB/s
>>AES-256-XTS (NEON bit-sliced)   104.7 MB/s  91.1 MB/s
>>
>> Signed-off-by: Eric Biggers 
>> ---
>>  arch/arm64/crypto/Kconfig   |   6 +
>>  arch/arm64/crypto/Makefile  |   3 +
>>  arch/arm64/crypto/speck-neon-core.S | 352 
>>  arch/arm64/crypto/speck-neon-glue.c | 282 ++
>>  4 files changed, 643 insertions(+)
>>  create mode 100644 arch/arm64/crypto/speck-neon-core.S
>>  create mode 100644 arch/arm64/crypto/speck-neon-glue.c
>>
>> diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
>> index 285c36c7b408..cb5a243110c4 100644
>> --- a/arch/arm64/crypto/Kconfig
>> +++ b/arch/arm64/crypto/Kconfig
>> @@ -113,4 +113,10 @@ config CRYPTO_AES_ARM64_BS
>>   select CRYPTO_AES_ARM64
>>   select CRYPTO_SIMD
>>
>> +config CRYPTO_SPECK_NEON
>> + tristate "NEON accelerated Speck cipher algorithms"
>> + depends on KERNEL_MODE_NEON
>> + select CRYPTO_BLKCIPHER
>> + select CRYPTO_SPECK
>> +
>>  endif
>> diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
>> index cee9b8d9830b..d94ebd15a859 100644
>> --- a/arch/arm64/crypto/Makefile
>> +++ b/arch/arm64/crypto/Makefile
>> @@ -53,6 +53,9 @@ sha512-arm64-y := sha512-glue.o sha512-core.o
>>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
>>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
>>
>> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>> +
>>  obj-$(CONFIG_CRYPTO_AES_ARM64) += aes-arm64.o
>>  aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
>>
>> diff --git a/arch/arm64/crypto/speck-neon-core.S 
>> b/arch/arm64/crypto/speck-neon-core.S
>> new file mode 100644
>> index ..b14463438b09
>> --- /dev/null
>> +++ b/arch/arm64/crypto/speck-neon-core.S
>> @@ -0,0 +1,352 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * ARM64 NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
>> + *
>> + * Copyright (c) 2018 Google, Inc
>> + *
>> + * Author: Eric Biggers 
>> + */
>> +
>> +#include 
>> +
>> + .text
>> +
>> + // arguments
>> + ROUND_KEYS  .reqx0  // const {u64,u32} *round_keys
>> + NROUNDS .reqw1  // int nrounds
>> + NROUNDS_X   .reqx1
>> + DST .reqx2  // void *dst
>> + SRC .reqx3  // const void *src
>> + NBYTES  .reqw4  // unsigned int nbytes
>> + TWEAK   .reqx5  // void *tweak
>> +
>> + // registers which hold the data being encrypted/decrypted
>> + // (underscores avoid a naming collision with ARM64 registers x0-x3)
>> + X_0 .reqv0
>> + Y_0 .reqv1

Re: [PATCH v2 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS

2018-02-13 Thread Ard Biesheuvel

On 13 February 2018 at 18:57, Eric Biggers <ebigg...@google.com> wrote:
> Hi Ard,
>
> On Tue, Feb 13, 2018 at 11:34:36AM +, Ard Biesheuvel wrote:
>> Hi Eric,
>>
>> On 12 February 2018 at 23:52, Eric Biggers <ebigg...@google.com> wrote:
>> > Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>> > 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>> > Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>> > encrypted/decrypted (doing one cipher round for all the blocks, then the
>> > next round, etc.), then goes through XTS postprocessing.
>> >
>> > The performance depends on the processor but can be about 3 times faster
>> > than the generic code.  For example, on an ARMv7 processor we observe
>> > the following performance with Speck128/256-XTS:
>> >
>> > xts-speck128-neon: Encryption 107.9 MB/s, Decryption 108.1 MB/s
>> > xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>> >
>> > In comparison to AES-256-XTS without the Cryptography Extensions:
>> >
>> > xts-aes-neonbs:Encryption  41.2 MB/s, Decryption  36.7 MB/s
>> > xts(aes-asm):  Encryption  31.7 MB/s, Decryption  30.8 MB/s
>> > xts(aes-generic):  Encryption  21.2 MB/s, Decryption  20.9 MB/s
>> >
>> > Speck64/128-XTS is even faster:
>> >
>> > xts-speck64-neon:  Encryption 138.6 MB/s, Decryption 139.1 MB/s
>> >
>> > Note that as with the generic code, only the Speck128 and Speck64
>> > variants are supported.  Also, for now only the XTS mode of operation is
>> > supported, to target the disk and file encryption use cases.  The NEON
>> > code also only handles the portion of the data that is evenly divisible
>> > into 128-byte chunks, with any remainder handled by a C fallback.  Of
>> > course, other modes of operation could be added later if needed, and/or
>> > the NEON code could be updated to handle other buffer sizes.
>> >
>> > The XTS specification is only defined for AES which has a 128-bit block
>> > size, so for the GF(2^64) math needed for Speck64-XTS we use the
>> > reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>> > paper.  Of course, when possible users should use Speck128-XTS, but even
>> > that may be too slow on some processors; Speck64-XTS can be faster.
>> >
>>
>> I think this is excellent work. Speck seems an appropriate solution to
>> this problem, and I'm glad we are not ending up with a stream cipher
>> for block encryption.
>>
>> Also, I think an arm64 port would be nice. I may take a stab at this
>> if nobody else beats me to it.
>
> We don't really want to encourage people to use Speck over AES with the
> Cryptography Extensions, so that's why I didn't include an arm64 port.  That
> being said, I suppose we can't stop people from adding an arm64 port if they
> really do prefer Speck, or maybe for use on arm64 CPUs that don't have the
> Cryptography Extensions (though I thought that almost all do).
>

Many do, but not all of them. A notable exception is the Raspberry Pi 3.

>>
>> I did run into an issue with this code though: On big-endian, I get
>>
>> [ 0.272381] alg: skcipher: Test 1 failed (invalid result) on
>> encryption for xts-speck64-neon
>> [0.276151] : 84 af 54 07 19 d4 7c a6 9c 8a ac f6 c2 14 04 d8
>> [0.278541] 0010: 7f 18 6c 43 56 ed 0b b3 92 21 a2 d9 17 59 e4 3b
>>
>> so there may be a byte order corner case you missed in the rewrite (or
>> the issue existed before, as I did not test your v1)
>>
>
> To be honest I haven't tested either version on a big endian ARM CPU yet.  I
> don't really know how to do that currently; maybe it's possible with QEMU.
>

I tested this on a big-endian 32-bit VM running under KVM on a 64-bit host.

> But assuming I haven't missed anything, in the assembly code everything is
> treated as byte arrays with the exception of the round keys which are 32-bit 
> or
> 64-bit numbers in CPU endianness.  The byte arrays are loaded and stored with
> vld1.8 and vst1.8 while the round keys are loaded with vld1.32 or vld1.64, so
> the assembly code *should* work correctly on a big endian CPU.
>

Indeed.

> However, looking over it now, I think there is a bug in the glue code for
> Speck64-XTS when it handles buffers not evenly divisible into 128 bytes.
> Namely, the tweak is treated as CPU endian when it should be little endian.
> Could you try the following patch?
>
> diff --git

Re: [PATCH v2 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS

2018-02-13 Thread Ard Biesheuvel

Hi Eric,

On 12 February 2018 at 23:52, Eric Biggers  wrote:
> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
> encrypted/decrypted (doing one cipher round for all the blocks, then the
> next round, etc.), then goes through XTS postprocessing.
>
> The performance depends on the processor but can be about 3 times faster
> than the generic code.  For example, on an ARMv7 processor we observe
> the following performance with Speck128/256-XTS:
>
> xts-speck128-neon: Encryption 107.9 MB/s, Decryption 108.1 MB/s
> xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>
> In comparison to AES-256-XTS without the Cryptography Extensions:
>
> xts-aes-neonbs:Encryption  41.2 MB/s, Decryption  36.7 MB/s
> xts(aes-asm):  Encryption  31.7 MB/s, Decryption  30.8 MB/s
> xts(aes-generic):  Encryption  21.2 MB/s, Decryption  20.9 MB/s
>
> Speck64/128-XTS is even faster:
>
> xts-speck64-neon:  Encryption 138.6 MB/s, Decryption 139.1 MB/s
>
> Note that as with the generic code, only the Speck128 and Speck64
> variants are supported.  Also, for now only the XTS mode of operation is
> supported, to target the disk and file encryption use cases.  The NEON
> code also only handles the portion of the data that is evenly divisible
> into 128-byte chunks, with any remainder handled by a C fallback.  Of
> course, other modes of operation could be added later if needed, and/or
> the NEON code could be updated to handle other buffer sizes.
>
> The XTS specification is only defined for AES which has a 128-bit block
> size, so for the GF(2^64) math needed for Speck64-XTS we use the
> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
> paper.  Of course, when possible users should use Speck128-XTS, but even
> that may be too slow on some processors; Speck64-XTS can be faster.
>

I think this is excellent work. Speck seems an appropriate solution to
this problem, and I'm glad we are not ending up with a stream cipher
for block encryption.

Also, I think an arm64 port would be nice. I may take a stab at this
if nobody else beats me to it.

I did run into an issue with this code though: On big-endian, I get

[0.272381] alg: skcipher: Test 1 failed (invalid result) on
encryption for xts-speck64-neon
[0.276151] : 84 af 54 07 19 d4 7c a6 9c 8a ac f6 c2 14 04 d8
[0.278541] 0010: 7f 18 6c 43 56 ed 0b b3 92 21 a2 d9 17 59 e4 3b

so there may be a byte order corner case you missed in the rewrite (or
the issue existed before, as I did not test your v1)

-- 
Ard.


> Signed-off-by: Eric Biggers 
> ---
>  arch/arm/crypto/Kconfig   |   6 +
>  arch/arm/crypto/Makefile  |   2 +
>  arch/arm/crypto/speck-neon-core.S | 432 
> ++
>  arch/arm/crypto/speck-neon-glue.c | 290 +
>  4 files changed, 730 insertions(+)
>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>
> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
> index b8e69fe282b8..925d1364727a 100644
> --- a/arch/arm/crypto/Kconfig
> +++ b/arch/arm/crypto/Kconfig
> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
> select CRYPTO_BLKCIPHER
> select CRYPTO_CHACHA20
>
> +config CRYPTO_SPECK_NEON
> +   tristate "NEON accelerated Speck cipher algorithms"
> +   depends on KERNEL_MODE_NEON
> +   select CRYPTO_BLKCIPHER
> +   select CRYPTO_SPECK
> +
>  endif
> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
> index 30ef8e291271..a758107c5525 100644
> --- a/arch/arm/crypto/Makefile
> +++ b/arch/arm/crypto/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>
>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
> @@ -53,6 +54,7 @@ ghash-arm-ce-y:= ghash-ce-core.o ghash-ce-glue.o
>  crct10dif-arm-ce-y := crct10dif-ce-core.o crct10dif-ce-glue.o
>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>
>  quiet_cmd_perl = PERL$@
>cmd_perl = $(PERL) $(<) > $(@)
> diff --git a/arch/arm/crypto/speck-neon-core.S 
> b/arch/arm/crypto/speck-neon-core.S
> new file mode 100644
> index ..3c1e203e53b9
> --- /dev/null
> +++ b/arch/arm/crypto/speck-neon-core.S
> @@ -0,0 +1,432 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * NEON-accelerated implementation of

Re: [PATCH] crypto: arm/aes-cipher - move S-box to .rodata section

2018-02-13 Thread Ard Biesheuvel

On 12 February 2018 at 13:52, Jinbum Park <jinb.pa...@gmail.com> wrote:
> Move the AES inverse S-box to the .rodata section
> where it is safe from abuse by speculation.
>
> Signed-off-by: Jinbum Park <jinb.pa...@gmail.com>

Acked-by: Ard Biesheuvel <ard.biesheu...@linaro.org>

> ---
>  arch/arm/crypto/aes-cipher-core.S | 19 ++-
>  1 file changed, 10 insertions(+), 9 deletions(-)
>
> diff --git a/arch/arm/crypto/aes-cipher-core.S 
> b/arch/arm/crypto/aes-cipher-core.S
> index 54b3840..184d6c2 100644
> --- a/arch/arm/crypto/aes-cipher-core.S
> +++ b/arch/arm/crypto/aes-cipher-core.S
> @@ -174,6 +174,16 @@
> .ltorg
> .endm
>
> +ENTRY(__aes_arm_encrypt)
> +   do_cryptfround, crypto_ft_tab, crypto_ft_tab + 1, 2
> +ENDPROC(__aes_arm_encrypt)
> +
> +   .align  5
> +ENTRY(__aes_arm_decrypt)
> +   do_cryptiround, crypto_it_tab, __aes_arm_inverse_sbox, 0
> +ENDPROC(__aes_arm_decrypt)
> +
> +   .section".rodata", "a"
> .align  L1_CACHE_SHIFT
> .type   __aes_arm_inverse_sbox, %object
>  __aes_arm_inverse_sbox:
> @@ -210,12 +220,3 @@ __aes_arm_inverse_sbox:
> .byte   0x17, 0x2b, 0x04, 0x7e, 0xba, 0x77, 0xd6, 0x26
> .byte   0xe1, 0x69, 0x14, 0x63, 0x55, 0x21, 0x0c, 0x7d
> .size   __aes_arm_inverse_sbox, . - __aes_arm_inverse_sbox
> -
> -ENTRY(__aes_arm_encrypt)
> -   do_cryptfround, crypto_ft_tab, crypto_ft_tab + 1, 2
> -ENDPROC(__aes_arm_encrypt)
> -
> -   .align  5
> -ENTRY(__aes_arm_decrypt)
> -   do_cryptiround, crypto_it_tab, __aes_arm_inverse_sbox, 0
> -ENDPROC(__aes_arm_decrypt)
> --
> 1.9.1
>

Re: [PATCH 1/3] compiler-gcc.h: Introduce __optimize function attribute

2018-02-01 Thread Ard Biesheuvel

On 1 February 2018 at 10:21, Geert Uytterhoeven <ge...@linux-m68k.org> wrote:
> Create a new function attribute __optimize, which allows to specify an
> optimization level on a per-function basis.
>
> Signed-off-by: Geert Uytterhoeven <ge...@linux-m68k.org>

Acked-by: Ard Biesheuvel <ard.biesheu...@linaro.org>

> ---
> I assume this is supported as of gcc-4.4:
>   - gcc version 4.3.3 (GCC): warning: ‘__optimize__’ attribute directive
> ignored
>   - gcc version 4.4.7 (Ubuntu/Linaro 4.4.7-1ubuntu2): OK
> ---
>  include/linux/compiler-gcc.h | 4 
>  include/linux/compiler.h | 4 
>  2 files changed, 8 insertions(+)
>
> diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
> index 631354acfa720475..0a278a527944ad2f 100644
> --- a/include/linux/compiler-gcc.h
> +++ b/include/linux/compiler-gcc.h
> @@ -196,6 +196,10 @@
>  #endif /* __CHECKER__ */
>  #endif /* GCC_VERSION >= 40300 */
>
> +#if GCC_VERSION >= 40400
> +#define __optimize(level)  __attribute__((__optimize__(level)))
> +#endif /* GCC_VERSION >= 40400 */
> +
>  #if GCC_VERSION >= 40500
>
>  #ifndef __CHECKER__
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 52e611ab9a6cf6fd..5ff818e9a836e898 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -271,6 +271,10 @@ static __always_inline void __write_once_size(volatile 
> void *p, void *res, int s
>
>  #endif /* __ASSEMBLY__ */
>
> +#ifndef __optimize
> +# define __optimize(level)
> +#endif
> +
>  /* Compile time object size, -1 for unknown */
>  #ifndef __compiletime_object_size
>  # define __compiletime_object_size(obj) -1
> --
> 2.7.4
>

Re: [PATCH 3/3] crypto: sha3-generic - Use __optimize to support old compilers

2018-02-01 Thread Ard Biesheuvel

On 1 February 2018 at 10:22, Geert Uytterhoeven <ge...@linux-m68k.org> wrote:
> With gcc-4.1.2:
>
> crypto/sha3_generic.c:39: warning: ‘__optimize__’ attribute directive 
> ignored
>
> Use the newly introduced __optimize macro to fix this.
>
> Fixes: 83dee2ce1ae791c3 ("crypto: sha3-generic - rewrite KECCAK transform to 
> help the compiler optimize")
> Signed-off-by: Geert Uytterhoeven <ge...@linux-m68k.org>

Acked-by: Ard Biesheuvel <ard.biesheu...@linaro.org>

> ---
>  crypto/sha3_generic.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/crypto/sha3_generic.c b/crypto/sha3_generic.c
> index a965b9d8055983af..c409cd87fea5decd 100644
> --- a/crypto/sha3_generic.c
> +++ b/crypto/sha3_generic.c
> @@ -35,7 +35,7 @@ static const u64 keccakf_rndc[24] = {
>
>  /* update the state with given number of rounds */
>
> -static void __attribute__((__optimize__("O3"))) keccakf(u64 st[25])
> +static void __optimize("O3") keccakf(u64 st[25])
>  {
> u64 t[5], tt, bc[5];
> int round;
> --
> 2.7.4
>

[PATCH] crypto/generic - sha3: deal with oversize stack frames

2018-01-27 Thread Ard Biesheuvel

As reported by kbuild test robot, the optimized SHA3 C implementation
compiles to mn10300 code that uses a disproportionate amount of stack
space, i.e.,

  crypto/sha3_generic.c: In function 'keccakf':
  crypto/sha3_generic.c:147:1: warning: the frame size of 1232 bytes is larger 
than 1024 bytes [-Wframe-larger-than=]

As kindly diagnosed by Arnd, this does not only occur when building for
the mn10300 architecture (which is what the report was about) but also
for h8300, and builds for other 32-bit architectures show an increase in
stack space utilization as well.

Given that SHA3 operates on 64-bit quantities, and keeps a state matrix
of 25 64-bit words, it is not surprising that 32-bit architectures with
few general purpose registers are impacted the most by this, and it is
therefore reasonable to implement a workaround that distinguishes between
32-bit and 64-bit architectures.

Arnd figured out that taking the round calculation out of the loop, and
inlining it explicitly but only on 64-bit architectures preserves most
of the performance gain achieved by the rewrite, and also gets rid of
the excessive use of stack space.

Reported-by: kbuild test robot <fengguang...@intel.com>
Suggested-by: Arnd Bergmann <a...@arndb.de>
Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 crypto/sha3_generic.c | 218 +++-
 1 file changed, 118 insertions(+), 100 deletions(-)

diff --git a/crypto/sha3_generic.c b/crypto/sha3_generic.c
index a965b9d80559..951c4eb70262 100644
--- a/crypto/sha3_generic.c
+++ b/crypto/sha3_generic.c
@@ -20,6 +20,20 @@
 #include 
 #include 
 
+/*
+ * On some 32-bit architectures (mn10300 and h8300), GCC ends up using
+ * over 1 KB of stack if we inline the round calculation into the loop
+ * in keccakf(). On the other hand, on 64-bit architectures with plenty
+ * of [64-bit wide] general purpose registers, not inlining it severely
+ * hurts performance. So let's use 64-bitness as a heuristic to decide
+ * whether to inline or not.
+ */
+#ifdef CONFIG_64BIT
+#define SHA3_INLINEinline
+#else
+#define SHA3_INLINEnoinline
+#endif
+
 #define KECCAK_ROUNDS 24
 
 static const u64 keccakf_rndc[24] = {
@@ -35,111 +49,115 @@ static const u64 keccakf_rndc[24] = {
 
 /* update the state with given number of rounds */
 
-static void __attribute__((__optimize__("O3"))) keccakf(u64 st[25])
+static SHA3_INLINE void keccakf_round(u64 st[25])
 {
u64 t[5], tt, bc[5];
-   int round;
 
-   for (round = 0; round < KECCAK_ROUNDS; round++) {
+   /* Theta */
+   bc[0] = st[0] ^ st[5] ^ st[10] ^ st[15] ^ st[20];
+   bc[1] = st[1] ^ st[6] ^ st[11] ^ st[16] ^ st[21];
+   bc[2] = st[2] ^ st[7] ^ st[12] ^ st[17] ^ st[22];
+   bc[3] = st[3] ^ st[8] ^ st[13] ^ st[18] ^ st[23];
+   bc[4] = st[4] ^ st[9] ^ st[14] ^ st[19] ^ st[24];
+
+   t[0] = bc[4] ^ rol64(bc[1], 1);
+   t[1] = bc[0] ^ rol64(bc[2], 1);
+   t[2] = bc[1] ^ rol64(bc[3], 1);
+   t[3] = bc[2] ^ rol64(bc[4], 1);
+   t[4] = bc[3] ^ rol64(bc[0], 1);
+
+   st[0] ^= t[0];
+
+   /* Rho Pi */
+   tt = st[1];
+   st[ 1] = rol64(st[ 6] ^ t[1], 44);
+   st[ 6] = rol64(st[ 9] ^ t[4], 20);
+   st[ 9] = rol64(st[22] ^ t[2], 61);
+   st[22] = rol64(st[14] ^ t[4], 39);
+   st[14] = rol64(st[20] ^ t[0], 18);
+   st[20] = rol64(st[ 2] ^ t[2], 62);
+   st[ 2] = rol64(st[12] ^ t[2], 43);
+   st[12] = rol64(st[13] ^ t[3], 25);
+   st[13] = rol64(st[19] ^ t[4],  8);
+   st[19] = rol64(st[23] ^ t[3], 56);
+   st[23] = rol64(st[15] ^ t[0], 41);
+   st[15] = rol64(st[ 4] ^ t[4], 27);
+   st[ 4] = rol64(st[24] ^ t[4], 14);
+   st[24] = rol64(st[21] ^ t[1],  2);
+   st[21] = rol64(st[ 8] ^ t[3], 55);
+   st[ 8] = rol64(st[16] ^ t[1], 45);
+   st[16] = rol64(st[ 5] ^ t[0], 36);
+   st[ 5] = rol64(st[ 3] ^ t[3], 28);
+   st[ 3] = rol64(st[18] ^ t[3], 21);
+   st[18] = rol64(st[17] ^ t[2], 15);
+   st[17] = rol64(st[11] ^ t[1], 10);
+   st[11] = rol64(st[ 7] ^ t[2],  6);
+   st[ 7] = rol64(st[10] ^ t[0],  3);
+   st[10] = rol64(tt ^ t[1],  1);
+
+   /* Chi */
+   bc[ 0] = ~st[ 1] & st[ 2];
+   bc[ 1] = ~st[ 2] & st[ 3];
+   bc[ 2] = ~st[ 3] & st[ 4];
+   bc[ 3] = ~st[ 4] & st[ 0];
+   bc[ 4] = ~st[ 0] & st[ 1];
+   st[ 0] ^= bc[ 0];
+   st[ 1] ^= bc[ 1];
+   st[ 2] ^= bc[ 2];
+   st[ 3] ^= bc[ 3];
+   st[ 4] ^= bc[ 4];
+
+   bc[ 0] = ~st[ 6] & st[ 7];
+   bc[ 1] = ~st[ 7] & st[ 8];
+   bc[ 2] = ~st[ 8] & st[ 9];
+   bc[ 3] = ~st[ 9] & st[ 5];
+   bc[ 4] = ~st[ 5] & st[ 6];
+   st[ 5] ^= bc[ 0];
+   st[ 6] ^= bc[ 1];
+   st[ 7] ^= bc[ 2];
+   st[ 8] ^= bc[ 3];
+   st[ 9] ^= bc[ 4];
+
+   bc[ 0] = ~st[11] & st[12];
+   bc[ 1] = ~st[12] & st[13];
+   bc[ 2] = ~st[13] & st[14];
+   bc[ 3] = ~st[14] & st[10

Re: [PATCH 0/8] crypto: arm64+generic - SHA3/SHA-512/SM-3 roundup

2018-01-22 Thread Ard Biesheuvel

On 22 January 2018 at 20:51, Arnd Bergmann <a...@arndb.de> wrote:
> On Mon, Jan 22, 2018 at 3:54 PM, Arnd Bergmann <a...@arndb.de> wrote:
>> On Fri, Jan 19, 2018 at 1:04 PM, Ard Biesheuvel
>> I'm doing a little more randconfig build testing here now, will write back by
>> the end of today in the unlikely case that if I find anything else wrong.
>
> Did a few hundred randconfig builds, everything fine as expected.
>

Thanks Arnd

[PATCH 6/8] crypto/arm64: sha3 - new v8.2 Crypto Extensions implementation

2018-01-19 Thread Ard Biesheuvel

Implement the various flavours of SHA3 using the new optional
EOR3/RAX1/XAR/BCAX instructions introduced by ARMv8.2.

Tested-by: Steve Capper <steve.cap...@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/Kconfig|   6 +
 arch/arm64/crypto/Makefile   |   3 +
 arch/arm64/crypto/sha3-ce-core.S | 210 
 arch/arm64/crypto/sha3-ce-glue.c | 161 +++
 4 files changed, 380 insertions(+)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index aad288f4b9de..3321b2c9a2b5 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -35,6 +35,12 @@ config CRYPTO_SHA512_ARM64_CE
select CRYPTO_HASH
select CRYPTO_SHA512_ARM64
 
+config CRYPTO_SHA3_ARM64
+   tristate "SHA3 digest algorithm (ARMv8.2 Crypto Extensions)"
+   depends on KERNEL_MODE_NEON
+   select CRYPTO_HASH
+   select CRYPTO_SHA3
+
 config CRYPTO_GHASH_ARM64_CE
tristate "GHASH/AES-GCM using ARMv8 Crypto Extensions"
depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index b438b3dc9b4c..4ca2d146e213 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -17,6 +17,9 @@ sha2-ce-y := sha2-ce-glue.o sha2-ce-core.o
 obj-$(CONFIG_CRYPTO_SHA512_ARM64_CE) += sha512-ce.o
 sha512-ce-y := sha512-ce-glue.o sha512-ce-core.o
 
+obj-$(CONFIG_CRYPTO_SHA3_ARM64) += sha3-ce.o
+sha3-ce-y := sha3-ce-glue.o sha3-ce-core.o
+
 obj-$(CONFIG_CRYPTO_GHASH_ARM64_CE) += ghash-ce.o
 ghash-ce-y := ghash-ce-glue.o ghash-ce-core.o
 
diff --git a/arch/arm64/crypto/sha3-ce-core.S b/arch/arm64/crypto/sha3-ce-core.S
new file mode 100644
index ..332ad7530690
--- /dev/null
+++ b/arch/arm64/crypto/sha3-ce-core.S
@@ -0,0 +1,210 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * sha3-ce-core.S - core SHA-3 transform using v8.2 Crypto Extensions
+ *
+ * Copyright (C) 2018 Linaro Ltd <ard.biesheu...@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+
+   .irp
b,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
+   .set.Lv\b\().2d, \b
+   .set.Lv\b\().16b, \b
+   .endr
+
+   /*
+* ARMv8.2 Crypto Extensions instructions
+*/
+   .macro  eor3, rd, rn, rm, ra
+   .inst   0xce00 | .L\rd | (.L\rn << 5) | (.L\ra << 10) | (.L\rm << 
16)
+   .endm
+
+   .macro  rax1, rd, rn, rm
+   .inst   0xce608c00 | .L\rd | (.L\rn << 5) | (.L\rm << 16)
+   .endm
+
+   .macro  bcax, rd, rn, rm, ra
+   .inst   0xce20 | .L\rd | (.L\rn << 5) | (.L\ra << 10) | (.L\rm << 
16)
+   .endm
+
+   .macro  xar, rd, rn, rm, imm6
+   .inst   0xce80 | .L\rd | (.L\rn << 5) | ((\imm6) << 10) | (.L\rm << 
16)
+   .endm
+
+   /*
+* sha3_ce_transform(u64 *st, const u8 *data, int blocks, int dg_size)
+*/
+   .text
+ENTRY(sha3_ce_transform)
+   /* load state */
+   add x8, x0, #32
+   ld1 { v0.1d- v3.1d}, [x0]
+   ld1 { v4.1d- v7.1d}, [x8], #32
+   ld1 { v8.1d-v11.1d}, [x8], #32
+   ld1 {v12.1d-v15.1d}, [x8], #32
+   ld1 {v16.1d-v19.1d}, [x8], #32
+   ld1 {v20.1d-v23.1d}, [x8], #32
+   ld1 {v24.1d}, [x8]
+
+0: sub w2, w2, #1
+   mov w8, #24
+   adr_l   x9, .Lsha3_rcon
+
+   /* load input */
+   ld1 {v25.8b-v28.8b}, [x1], #32
+   ld1 {v29.8b-v31.8b}, [x1], #24
+   eor v0.8b, v0.8b, v25.8b
+   eor v1.8b, v1.8b, v26.8b
+   eor v2.8b, v2.8b, v27.8b
+   eor v3.8b, v3.8b, v28.8b
+   eor v4.8b, v4.8b, v29.8b
+   eor v5.8b, v5.8b, v30.8b
+   eor v6.8b, v6.8b, v31.8b
+
+   tbnzx3, #6, 2f  // SHA3-512
+
+   ld1 {v25.8b-v28.8b}, [x1], #32
+   ld1 {v29.8b-v30.8b}, [x1], #16
+   eor  v7.8b,  v7.8b, v25.8b
+   eor  v8.8b,  v8.8b, v26.8b
+   eor  v9.8b,  v9.8b, v27.8b
+   eor v10.8b, v10.8b, v28.8b
+   eor v11.8b, v11.8b, v29.8b
+   eor v12.8b, v12.8b, v30.8b
+
+   tbnzx3, #4, 1f  // SHA3-384 or SHA3-224
+
+   // SHA3-256
+   ld1 {v25.8b-v28.8b}, [x1], #32
+   eor v13.8b, v13.8b, v25.8b
+   eor v14.8b, v14.8b, v26.8b
+   eor v15.8b, v15.8b, v27.8b
+   eor v16.8b, v16.8b, v28.8b
+   b   3f
+
+1: tbz x3, #2, 3f  // bit 2 cleared? SHA-384
+
+   // SHA3-224
+   ld1 {v25.8b-v28.8b}, [x1], #32
+   ld1 {v29.8b}, [x1], #8
+   eor v13.8b, v13.8b, v25.8b
+   eor v14.8b, v14.8b, v26.8b
+   eor v15.8b, v15.8b,

[PATCH 3/8] crypto/generic: sha3 - simplify code

2018-01-19 Thread Ard Biesheuvel

In preparation of exposing the generic SHA3 implementation to other
versions as a fallback, simplify the code, and remove an inconsistency
in the output handling (endian swabbing rsizw words of state before
writing the output does not make sense)

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 crypto/sha3_generic.c | 184 +++-
 include/crypto/sha3.h |   1 -
 2 files changed, 59 insertions(+), 126 deletions(-)

diff --git a/crypto/sha3_generic.c b/crypto/sha3_generic.c
index 5fecb609e3be..c7084a24eaf9 100644
--- a/crypto/sha3_generic.c
+++ b/crypto/sha3_generic.c
@@ -18,7 +18,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #define KECCAK_ROUNDS 24
@@ -146,43 +145,16 @@ static void __attribute__((__optimize__("O3"))) 
keccakf(u64 st[25])
}
 }
 
-static void sha3_init(struct sha3_state *sctx, unsigned int digest_sz)
-{
-   memset(sctx, 0, sizeof(*sctx));
-   sctx->md_len = digest_sz;
-   sctx->rsiz = 200 - 2 * digest_sz;
-   sctx->rsizw = sctx->rsiz / 8;
-}
-
-static int sha3_224_init(struct shash_desc *desc)
+static int sha3_init(struct shash_desc *desc)
 {
struct sha3_state *sctx = shash_desc_ctx(desc);
+   unsigned int digest_size = crypto_shash_digestsize(desc->tfm);
 
-   sha3_init(sctx, SHA3_224_DIGEST_SIZE);
-   return 0;
-}
-
-static int sha3_256_init(struct shash_desc *desc)
-{
-   struct sha3_state *sctx = shash_desc_ctx(desc);
-
-   sha3_init(sctx, SHA3_256_DIGEST_SIZE);
-   return 0;
-}
-
-static int sha3_384_init(struct shash_desc *desc)
-{
-   struct sha3_state *sctx = shash_desc_ctx(desc);
-
-   sha3_init(sctx, SHA3_384_DIGEST_SIZE);
-   return 0;
-}
-
-static int sha3_512_init(struct shash_desc *desc)
-{
-   struct sha3_state *sctx = shash_desc_ctx(desc);
+   sctx->rsiz = 200 - 2 * digest_size;
+   sctx->rsizw = sctx->rsiz / 8;
+   sctx->partial = 0;
 
-   sha3_init(sctx, SHA3_512_DIGEST_SIZE);
+   memset(sctx->st, 0, sizeof(sctx->st));
return 0;
 }
 
@@ -227,6 +199,8 @@ static int sha3_final(struct shash_desc *desc, u8 *out)
 {
struct sha3_state *sctx = shash_desc_ctx(desc);
unsigned int i, inlen = sctx->partial;
+   unsigned int digest_size = crypto_shash_digestsize(desc->tfm);
+   __le64 *digest = (__le64 *)out;
 
sctx->buf[inlen++] = 0x06;
memset(sctx->buf + inlen, 0, sctx->rsiz - inlen);
@@ -237,110 +211,70 @@ static int sha3_final(struct shash_desc *desc, u8 *out)
 
keccakf(sctx->st);
 
-   for (i = 0; i < sctx->rsizw; i++)
-   sctx->st[i] = cpu_to_le64(sctx->st[i]);
+   for (i = 0; i < digest_size / 8; i++)
+   put_unaligned_le64(sctx->st[i], digest++);
 
-   memcpy(out, sctx->st, sctx->md_len);
+   if (digest_size & 4)
+   put_unaligned_le32(sctx->st[i], (__le32 *)digest);
 
memset(sctx, 0, sizeof(*sctx));
return 0;
 }
 
-static struct shash_alg sha3_224 = {
-   .digestsize =   SHA3_224_DIGEST_SIZE,
-   .init   =   sha3_224_init,
-   .update =   sha3_update,
-   .final  =   sha3_final,
-   .descsize   =   sizeof(struct sha3_state),
-   .base   =   {
-   .cra_name   =   "sha3-224",
-   .cra_driver_name =  "sha3-224-generic",
-   .cra_flags  =   CRYPTO_ALG_TYPE_SHASH,
-   .cra_blocksize  =   SHA3_224_BLOCK_SIZE,
-   .cra_module =   THIS_MODULE,
-   }
-};
-
-static struct shash_alg sha3_256 = {
-   .digestsize =   SHA3_256_DIGEST_SIZE,
-   .init   =   sha3_256_init,
-   .update =   sha3_update,
-   .final  =   sha3_final,
-   .descsize   =   sizeof(struct sha3_state),
-   .base   =   {
-   .cra_name   =   "sha3-256",
-   .cra_driver_name =  "sha3-256-generic",
-   .cra_flags  =   CRYPTO_ALG_TYPE_SHASH,
-   .cra_blocksize  =   SHA3_256_BLOCK_SIZE,
-   .cra_module =   THIS_MODULE,
-   }
-};
-
-static struct shash_alg sha3_384 = {
-   .digestsize =   SHA3_384_DIGEST_SIZE,
-   .init   =   sha3_384_init,
-   .update =   sha3_update,
-   .final  =   sha3_final,
-   .descsize   =   sizeof(struct sha3_state),
-   .base   =   {
-   .cra_name   =   "sha3-384",
-   .cra_driver_name =  "sha3-384-generic",
-   .cra_flags  =   CRYPTO_ALG_TYPE_SHASH,
-   .cra_blocksize  =   SHA3_384_BLOCK_SIZE,
-   .cra_module =   THIS_MODULE,
-   }
-}

[PATCH 5/8] crypto/testmgr: sha3 - add new testcases

2018-01-19 Thread Ard Biesheuvel

All current SHA3 test cases are smaller than the SHA3 block size, which
means not all code paths are being exercised. So add a new test case to
each variant, and make one of the existing test cases chunked.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 crypto/testmgr.h | 550 
 1 file changed, 550 insertions(+)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index a714b6293959..6044f6906bd6 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -1052,6 +1052,142 @@ static const struct hash_testvec sha3_224_tv_template[] 
= {
"\xc9\xfd\x55\x74\x49\x44\x79\xba"
"\x5c\x7e\x7a\xb7\x6e\xf2\x64\xea"
"\xd0\xfc\xce\x33",
+   .np = 2,
+   .tap= { 28, 28 },
+   }, {
+   .plaintext = "\x08\x9f\x13\xaa\x41\xd8\x4c\xe3"
+"\x7a\x11\x85\x1c\xb3\x27\xbe\x55"
+"\xec\x60\xf7\x8e\x02\x99\x30\xc7"
+"\x3b\xd2\x69\x00\x74\x0b\xa2\x16"
+"\xad\x44\xdb\x4f\xe6\x7d\x14\x88"
+"\x1f\xb6\x2a\xc1\x58\xef\x63\xfa"
+"\x91\x05\x9c\x33\xca\x3e\xd5\x6c"
+"\x03\x77\x0e\xa5\x19\xb0\x47\xde"
+"\x52\xe9\x80\x17\x8b\x22\xb9\x2d"
+"\xc4\x5b\xf2\x66\xfd\x94\x08\x9f"
+"\x36\xcd\x41\xd8\x6f\x06\x7a\x11"
+"\xa8\x1c\xb3\x4a\xe1\x55\xec\x83"
+"\x1a\x8e\x25\xbc\x30\xc7\x5e\xf5"
+"\x69\x00\x97\x0b\xa2\x39\xd0\x44"
+"\xdb\x72\x09\x7d\x14\xab\x1f\xb6"
+"\x4d\xe4\x58\xef\x86\x1d\x91\x28"
+"\xbf\x33\xca\x61\xf8\x6c\x03\x9a"
+"\x0e\xa5\x3c\xd3\x47\xde\x75\x0c"
+"\x80\x17\xae\x22\xb9\x50\xe7\x5b"
+"\xf2\x89\x20\x94\x2b\xc2\x36\xcd"
+"\x64\xfb\x6f\x06\x9d\x11\xa8\x3f"
+"\xd6\x4a\xe1\x78\x0f\x83\x1a\xb1"
+"\x25\xbc\x53\xea\x5e\xf5\x8c\x00"
+"\x97\x2e\xc5\x39\xd0\x67\xfe\x72"
+"\x09\xa0\x14\xab\x42\xd9\x4d\xe4"
+"\x7b\x12\x86\x1d\xb4\x28\xbf\x56"
+"\xed\x61\xf8\x8f\x03\x9a\x31\xc8"
+"\x3c\xd3\x6a\x01\x75\x0c\xa3\x17"
+"\xae\x45\xdc\x50\xe7\x7e\x15\x89"
+"\x20\xb7\x2b\xc2\x59\xf0\x64\xfb"
+"\x92\x06\x9d\x34\xcb\x3f\xd6\x6d"
+"\x04\x78\x0f\xa6\x1a\xb1\x48\xdf"
+"\x53\xea\x81\x18\x8c\x23\xba\x2e"
+"\xc5\x5c\xf3\x67\xfe\x95\x09\xa0"
+"\x37\xce\x42\xd9\x70\x07\x7b\x12"
+"\xa9\x1d\xb4\x4b\xe2\x56\xed\x84"
+"\x1b\x8f\x26\xbd\x31\xc8\x5f\xf6"
+"\x6a\x01\x98\x0c\xa3\x3a\xd1\x45"
+"\xdc\x73\x0a\x7e\x15\xac\x20\xb7"
+"\x4e\xe5\x59\xf0\x87\x1e\x92\x29"
+"\xc0\x34\xcb\x62\xf9\x6d\x04\x9b"
+"\x0f\xa6\x3d\xd4\x48\xdf\x76\x0d"
+"\x81\x18\xaf\x23\xba\x51\xe8\x5c"
+"\xf3\x8a\x21\x95\x2c\xc3\x37\xce"
+"\x65\xfc\x70\x07\x9e\x12\xa9\x40"
+"\xd7\x4b\xe2\x79\x10\x84\x1b\xb2"
+"\x26\xbd\x54\xeb\x5f\xf6\x8d\x01"
+"\x98\x2f\xc6\x3a\xd1\x68\xff\x73"
+"\x0a\xa1\x15\xac\x43\xda\x4e\xe5"
+"\x7c\x13\x87\x1e\xb5\x29\xc0\x57"
+"\xee\x62\xf9\x90\x04\x9b\x32\xc9"
+"\x3d\xd4\x6b\x02\x76\x0d\xa4\x18"
+"\xaf\x46\xdd\x51\xe8\x7f\x16\x8a"
+"\x21\xb8\x2c\xc3\x5a\xf1\x65\xfc"
+"\x93\x07\x9e\x35\xcc\x40\xd7\x6e"
+"\x05\x79\x10\xa7\

[PATCH 4/8] crypto/generic: sha3 - export init/update/final routines

2018-01-19 Thread Ard Biesheuvel

To allow accelerated implementations to fall back to the generic
routines, e.g., in contexts where a SIMD based implementation is
not allowed to run, expose the generic SHA3 init/update/final
routines to other modules.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 crypto/sha3_generic.c | 33 +++-
 include/crypto/sha3.h |  5 +++
 2 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/crypto/sha3_generic.c b/crypto/sha3_generic.c
index c7084a24eaf9..a965b9d80559 100644
--- a/crypto/sha3_generic.c
+++ b/crypto/sha3_generic.c
@@ -145,7 +145,7 @@ static void __attribute__((__optimize__("O3"))) keccakf(u64 
st[25])
}
 }
 
-static int sha3_init(struct shash_desc *desc)
+int crypto_sha3_init(struct shash_desc *desc)
 {
struct sha3_state *sctx = shash_desc_ctx(desc);
unsigned int digest_size = crypto_shash_digestsize(desc->tfm);
@@ -157,8 +157,9 @@ static int sha3_init(struct shash_desc *desc)
memset(sctx->st, 0, sizeof(sctx->st));
return 0;
 }
+EXPORT_SYMBOL(crypto_sha3_init);
 
-static int sha3_update(struct shash_desc *desc, const u8 *data,
+int crypto_sha3_update(struct shash_desc *desc, const u8 *data,
   unsigned int len)
 {
struct sha3_state *sctx = shash_desc_ctx(desc);
@@ -194,8 +195,9 @@ static int sha3_update(struct shash_desc *desc, const u8 
*data,
 
return 0;
 }
+EXPORT_SYMBOL(crypto_sha3_update);
 
-static int sha3_final(struct shash_desc *desc, u8 *out)
+int crypto_sha3_final(struct shash_desc *desc, u8 *out)
 {
struct sha3_state *sctx = shash_desc_ctx(desc);
unsigned int i, inlen = sctx->partial;
@@ -220,12 +222,13 @@ static int sha3_final(struct shash_desc *desc, u8 *out)
memset(sctx, 0, sizeof(*sctx));
return 0;
 }
+EXPORT_SYMBOL(crypto_sha3_final);
 
 static struct shash_alg algs[] = { {
.digestsize = SHA3_224_DIGEST_SIZE,
-   .init   = sha3_init,
-   .update = sha3_update,
-   .final  = sha3_final,
+   .init   = crypto_sha3_init,
+   .update = crypto_sha3_update,
+   .final  = crypto_sha3_final,
.descsize   = sizeof(struct sha3_state),
.base.cra_name  = "sha3-224",
.base.cra_driver_name   = "sha3-224-generic",
@@ -234,9 +237,9 @@ static struct shash_alg algs[] = { {
.base.cra_module= THIS_MODULE,
 }, {
.digestsize = SHA3_256_DIGEST_SIZE,
-   .init   = sha3_init,
-   .update = sha3_update,
-   .final  = sha3_final,
+   .init   = crypto_sha3_init,
+   .update = crypto_sha3_update,
+   .final  = crypto_sha3_final,
.descsize   = sizeof(struct sha3_state),
.base.cra_name  = "sha3-256",
.base.cra_driver_name   = "sha3-256-generic",
@@ -245,9 +248,9 @@ static struct shash_alg algs[] = { {
.base.cra_module= THIS_MODULE,
 }, {
.digestsize = SHA3_384_DIGEST_SIZE,
-   .init   = sha3_init,
-   .update = sha3_update,
-   .final  = sha3_final,
+   .init   = crypto_sha3_init,
+   .update = crypto_sha3_update,
+   .final  = crypto_sha3_final,
.descsize   = sizeof(struct sha3_state),
.base.cra_name  = "sha3-384",
.base.cra_driver_name   = "sha3-384-generic",
@@ -256,9 +259,9 @@ static struct shash_alg algs[] = { {
.base.cra_module= THIS_MODULE,
 }, {
.digestsize = SHA3_512_DIGEST_SIZE,
-   .init   = sha3_init,
-   .update = sha3_update,
-   .final  = sha3_final,
+   .init   = crypto_sha3_init,
+   .update = crypto_sha3_update,
+   .final  = crypto_sha3_final,
.descsize   = sizeof(struct sha3_state),
.base.cra_name  = "sha3-512",
.base.cra_driver_name   = "sha3-512-generic",
diff --git a/include/crypto/sha3.h b/include/crypto/sha3.h
index 1339dcdbc9b2..080f60c2e6b1 100644
--- a/include/crypto/sha3.h
+++ b/include/crypto/sha3.h
@@ -26,4 +26,9 @@ struct sha3_state {
u8  buf[SHA3_224_BLOCK_SIZE];
 };
 
+int crypto_sha3_init(struct shash_desc *desc);
+int crypto_sha3_update(struct shash_desc *desc, const u8 *data,
+  unsigned int len);
+int crypto_sha3_final(struct shash_desc *desc, u8 *out);
+
 #endif
-- 
2.11.0

[PATCH 7/8] crypto/arm64: sm3 - new v8.2 Crypto Extensions implementation

2018-01-19 Thread Ard Biesheuvel

Implement the Chinese SM3 secure hash algorithm using the new
special instructions that have been introduced as an optional
extension in ARMv8.2.

Tested-by: Steve Capper <steve.cap...@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/Kconfig   |   6 +
 arch/arm64/crypto/Makefile  |   3 +
 arch/arm64/crypto/sm3-ce-core.S | 141 
 arch/arm64/crypto/sm3-ce-glue.c |  92 +
 4 files changed, 242 insertions(+)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 3321b2c9a2b5..285c36c7b408 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -41,6 +41,12 @@ config CRYPTO_SHA3_ARM64
select CRYPTO_HASH
select CRYPTO_SHA3
 
+config CRYPTO_SM3_ARM64_CE
+   tristate "SM3 digest algorithm (ARMv8.2 Crypto Extensions)"
+   depends on KERNEL_MODE_NEON
+   select CRYPTO_HASH
+   select CRYPTO_SM3
+
 config CRYPTO_GHASH_ARM64_CE
tristate "GHASH/AES-GCM using ARMv8 Crypto Extensions"
depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 4ca2d146e213..cee9b8d9830b 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -20,6 +20,9 @@ sha512-ce-y := sha512-ce-glue.o sha512-ce-core.o
 obj-$(CONFIG_CRYPTO_SHA3_ARM64) += sha3-ce.o
 sha3-ce-y := sha3-ce-glue.o sha3-ce-core.o
 
+obj-$(CONFIG_CRYPTO_SM3_ARM64_CE) += sm3-ce.o
+sm3-ce-y := sm3-ce-glue.o sm3-ce-core.o
+
 obj-$(CONFIG_CRYPTO_GHASH_ARM64_CE) += ghash-ce.o
 ghash-ce-y := ghash-ce-glue.o ghash-ce-core.o
 
diff --git a/arch/arm64/crypto/sm3-ce-core.S b/arch/arm64/crypto/sm3-ce-core.S
new file mode 100644
index ..27169fe07a68
--- /dev/null
+++ b/arch/arm64/crypto/sm3-ce-core.S
@@ -0,0 +1,141 @@
+/*
+ * sm3-ce-core.S - SM3 secure hash using ARMv8.2 Crypto Extensions
+ *
+ * Copyright (C) 2018 Linaro Ltd <ard.biesheu...@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+
+   .irpb, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
+   .set.Lv\b\().4s, \b
+   .endr
+
+   .macro  sm3partw1, rd, rn, rm
+   .inst   0xce60c000 | .L\rd | (.L\rn << 5) | (.L\rm << 16)
+   .endm
+
+   .macro  sm3partw2, rd, rn, rm
+   .inst   0xce60c400 | .L\rd | (.L\rn << 5) | (.L\rm << 16)
+   .endm
+
+   .macro  sm3ss1, rd, rn, rm, ra
+   .inst   0xce40 | .L\rd | (.L\rn << 5) | (.L\ra << 10) | 
(.L\rm << 16)
+   .endm
+
+   .macro  sm3tt1a, rd, rn, rm, imm2
+   .inst   0xce408000 | .L\rd | (.L\rn << 5) | ((\imm2) << 12) | 
(.L\rm << 16)
+   .endm
+
+   .macro  sm3tt1b, rd, rn, rm, imm2
+   .inst   0xce408400 | .L\rd | (.L\rn << 5) | ((\imm2) << 12) | 
(.L\rm << 16)
+   .endm
+
+   .macro  sm3tt2a, rd, rn, rm, imm2
+   .inst   0xce408800 | .L\rd | (.L\rn << 5) | ((\imm2) << 12) | 
(.L\rm << 16)
+   .endm
+
+   .macro  sm3tt2b, rd, rn, rm, imm2
+   .inst   0xce408c00 | .L\rd | (.L\rn << 5) | ((\imm2) << 12) | 
(.L\rm << 16)
+   .endm
+
+   .macro  round, ab, s0, t0, t1, i
+   sm3ss1  v5.4s, v8.4s, \t0\().4s, v9.4s
+   shl \t1\().4s, \t0\().4s, #1
+   sri \t1\().4s, \t0\().4s, #31
+   sm3tt1\ab   v8.4s, v5.4s, v10.4s, \i
+   sm3tt2\ab   v9.4s, v5.4s, \s0\().4s, \i
+   .endm
+
+   .macro  qround, ab, s0, s1, s2, s3, s4
+   .ifnb   \s4
+   ext \s4\().16b, \s1\().16b, \s2\().16b, #12
+   ext v6.16b, \s0\().16b, \s1\().16b, #12
+   ext v7.16b, \s2\().16b, \s3\().16b, #8
+   sm3partw1   \s4\().4s, \s0\().4s, \s3\().4s
+   .endif
+
+   eor v10.16b, \s0\().16b, \s1\().16b
+
+   round   \ab, \s0, v11, v12, 0
+   round   \ab, \s0, v12, v11, 1
+   round   \ab, \s0, v11, v12, 2
+   round   \ab, \s0, v12, v11, 3
+
+   .ifnb   \s4
+   sm3partw2   \s4\().4s, v7.4s, v6.4s
+   .endif
+   .endm
+
+   /*
+* void sm3_ce_transform(struct sm3_state *sst, u8 const *src,
+*   int blocks)
+*/
+   .text
+ENTRY(sm3_ce_transform)
+   /* load state */
+   ld1 {v8.4s-v9.4s}, [x0]
+   rev64   v8.4s, v8.4s
+   rev64   v9.4s, v9.4s
+   ext v8.16b, v8.16b, v8.16b, #8
+   ext v9.16b, v9.16b, v9.16b, #8
+
+   adr_l   x8, .Lt
+   ldp

[PATCH 8/8] crypto/arm64: sha512 - fix/improve new v8.2 Crypto Extensions code

2018-01-19 Thread Ard Biesheuvel

Add a missing symbol export that prevents this code to be built as a
module. Also, move the round constant table to the .rodata section,
and use a more optimized version of the core transform.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/sha512-ce-core.S | 145 ++--
 arch/arm64/crypto/sha512-glue.c|   1 +
 2 files changed, 72 insertions(+), 74 deletions(-)

diff --git a/arch/arm64/crypto/sha512-ce-core.S 
b/arch/arm64/crypto/sha512-ce-core.S
index 6c562f8df0b0..7f3bca5c59a2 100644
--- a/arch/arm64/crypto/sha512-ce-core.S
+++ b/arch/arm64/crypto/sha512-ce-core.S
@@ -12,10 +12,7 @@
 #include 
 #include 
 
-   //
-   // Temporary - for testing only. binutils has no support for these yet
-   //
-   .irp
b,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
+   .irpb,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
.set.Lq\b, \b
.set.Lv\b\().2d, \b
.endr
@@ -36,12 +33,10 @@
.inst   0xce608800 | .L\rd | (.L\rn << 5) | (.L\rm << 16)
.endm
 
-   .text
-   .arch   armv8-a+crypto
-
/*
 * The SHA-512 round constants
 */
+   .section".rodata", "a"
.align  4
 .Lsha512_rcon:
.quad   0x428a2f98d728ae22, 0x7137449123ef65cd
@@ -87,20 +82,20 @@
 
.macro  dround, i0, i1, i2, i3, i4, rc0, rc1, in0, in1, in2, 
in3, in4
.ifnb   \rc1
-   ld1 {v\rc1\().2d}, [x3], #16
+   ld1 {v\rc1\().2d}, [x4], #16
.endif
-   add v\rc0\().2d, v\rc0\().2d, v\in0\().2d
+   add v5.2d, v\rc0\().2d, v\in0\().2d
ext v6.16b, v\i2\().16b, v\i3\().16b, #8
-   ext v\rc0\().16b, v\rc0\().16b, v\rc0\().16b, #8
+   ext v5.16b, v5.16b, v5.16b, #8
ext v7.16b, v\i1\().16b, v\i2\().16b, #8
-   add v\i3\().2d, v\i3\().2d, v\rc0\().2d
+   add v\i3\().2d, v\i3\().2d, v5.2d
.ifnb   \in1
-   ext v10.16b, v\in3\().16b, v\in4\().16b, #8
+   ext v5.16b, v\in3\().16b, v\in4\().16b, #8
sha512su0   v\in0\().2d, v\in1\().2d
.endif
sha512h q\i3, q6, v7.2d
.ifnb   \in1
-   sha512su1   v\in0\().2d, v\in2\().2d, v10.2d
+   sha512su1   v\in0\().2d, v\in2\().2d, v5.2d
.endif
add v\i4\().2d, v\i1\().2d, v\i3\().2d
sha512h2q\i3, q\i1, v\i0\().2d
@@ -110,18 +105,20 @@
 * void sha512_ce_transform(struct sha512_state *sst, u8 const *src,
 *int blocks)
 */
+   .text
 ENTRY(sha512_ce_transform)
/* load state */
-   ld1 {v20.2d-v23.2d}, [x0]
+   ld1 {v8.2d-v11.2d}, [x0]
+
+   /* load first 4 round constants */
+   adr_l   x3, .Lsha512_rcon
+   ld1 {v20.2d-v23.2d}, [x3], #64
 
/* load input */
 0: ld1 {v12.2d-v15.2d}, [x1], #64
ld1 {v16.2d-v19.2d}, [x1], #64
sub w2, w2, #1
 
-   /* load round constants */
-   adr x3, .Lsha512_rcon
-
 CPU_LE(rev64   v12.16b, v12.16b)
 CPU_LE(rev64   v13.16b, v13.16b)
 CPU_LE(rev64   v14.16b, v14.16b)
@@ -131,12 +128,12 @@ CPU_LE(   rev64   v17.16b, v17.16b)
 CPU_LE(rev64   v18.16b, v18.16b)
 CPU_LE(rev64   v19.16b, v19.16b)
 
-   ld1 {v8.2d}, [x3], #16
+   mov x4, x3  // rc pointer
 
-   mov v0.16b, v20.16b
-   mov v1.16b, v21.16b
-   mov v2.16b, v22.16b
-   mov v3.16b, v23.16b
+   mov v0.16b, v8.16b
+   mov v1.16b, v9.16b
+   mov v2.16b, v10.16b
+   mov v3.16b, v11.16b
 
// v0  ab  cd  --  ef  gh  ab
// v1  cd  --  ef  gh  ab  cd
@@ -144,64 +141,64 @@ CPU_LE(   rev64   v19.16b, v19.16b)
// v3  gh  ab  cd  --  ef  gh
// v4  --  ef  gh  ab  cd  --
 
-   dround  0, 1, 2, 3, 4, 8, 9, 12, 13, 19, 16, 17
-   dround  3, 0, 4, 2, 1, 9, 8, 13, 14, 12, 17, 18
-   dround  2, 3, 1, 4, 0, 8, 9, 14, 15, 13, 18, 19
-   dround  4, 2, 0, 1, 3, 9, 8, 15, 16, 14, 19, 12
-   dround  1, 4, 3, 0, 2, 8, 9, 16, 17, 15, 12, 13
-
-   dround  0, 1, 2, 3, 4, 9, 8, 17, 18, 16, 13, 14
-   dround  3, 0, 4, 2, 1, 8, 9, 18, 19, 17, 14, 15
-   dround  2, 3, 1, 4, 0, 9, 8, 19, 12, 18, 15, 16
-   dround  4, 2, 0, 1, 3, 8, 9, 12, 13, 1

[PATCH 1/8] crypto/generic: sha3 - fixes for alignment and big endian operation

2018-01-19 Thread Ard Biesheuvel

Ensure that the input is byte swabbed before injecting it into the
SHA3 transform. Use the get_unaligned() accessor for this so that
we don't perform unaligned access inadvertently on architectures
that do not support that.

Cc: <sta...@vger.kernel.org>
Fixes: 53964b9ee63b7075 ("crypto: sha3 - Add SHA-3 hash algorithm")
Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 crypto/sha3_generic.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/crypto/sha3_generic.c b/crypto/sha3_generic.c
index 7e8ed96236ce..a68be626017c 100644
--- a/crypto/sha3_generic.c
+++ b/crypto/sha3_generic.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define KECCAK_ROUNDS 24
 
@@ -149,7 +150,7 @@ static int sha3_update(struct shash_desc *desc, const u8 
*data,
unsigned int i;
 
for (i = 0; i < sctx->rsizw; i++)
-   sctx->st[i] ^= ((u64 *) src)[i];
+   sctx->st[i] ^= get_unaligned_le64(src + 8 * i);
keccakf(sctx->st);
 
done += sctx->rsiz;
@@ -174,7 +175,7 @@ static int sha3_final(struct shash_desc *desc, u8 *out)
sctx->buf[sctx->rsiz - 1] |= 0x80;
 
for (i = 0; i < sctx->rsizw; i++)
-   sctx->st[i] ^= ((u64 *) sctx->buf)[i];
+   sctx->st[i] ^= get_unaligned_le64(sctx->buf + 8 * i);
 
keccakf(sctx->st);
 
-- 
2.11.0

[PATCH 2/8] crypto/generic: sha3: rewrite KECCAK transform to help the compiler optimize

2018-01-19 Thread Ard Biesheuvel

The way the KECCAK transform is currently coded involves many references
into the state array using indexes that are calculated at runtime using
simple but non-trivial arithmetic. This forces the compiler to treat the
state matrix as an array in memory rather than keep it in registers,
which results in poor performance.

So instead, let's rephrase the algorithm using fixed array indexes only.
This helps the compiler keep the state matrix in registers, resulting
in the following speedup (SHA3-256 performance in cycles per byte):

before   after   speedup
  Intel Core i7 @ 2.0 GHz (2.9 turbo)100.635.7 2.8x
  Cortex-A57 @ 2.0 GHz (64-bit mode) 101.612.7 8.0x
  Cortex-A53 @ 1.0 GHz   224.415.814.2x
  Cortex-A57 @ 2.0 GHz (32-bit mode) 201.863.0 3.2x

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 crypto/sha3_generic.c | 134 ++--
 1 file changed, 96 insertions(+), 38 deletions(-)

diff --git a/crypto/sha3_generic.c b/crypto/sha3_generic.c
index a68be626017c..5fecb609e3be 100644
--- a/crypto/sha3_generic.c
+++ b/crypto/sha3_generic.c
@@ -5,6 +5,7 @@
  * http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf
  *
  * SHA-3 code by Jeff Garzik <j...@garzik.org>
+ *       Ard Biesheuvel <ard.biesheu...@linaro.org>
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of the GNU General Public License as published by the Free
@@ -22,8 +23,6 @@
 
 #define KECCAK_ROUNDS 24
 
-#define ROTL64(x, y) (((x) << (y)) | ((x) >> (64 - (y
-
 static const u64 keccakf_rndc[24] = {
0x0001ULL, 0x8082ULL, 0x8000808aULL,
0x800080008000ULL, 0x808bULL, 0x8001ULL,
@@ -35,53 +34,112 @@ static const u64 keccakf_rndc[24] = {
0x80008080ULL, 0x8001ULL, 0x800080008008ULL
 };
 
-static const int keccakf_rotc[24] = {
-   1,  3,  6,  10, 15, 21, 28, 36, 45, 55, 2,  14,
-   27, 41, 56, 8,  25, 43, 62, 18, 39, 61, 20, 44
-};
-
-static const int keccakf_piln[24] = {
-   10, 7,  11, 17, 18, 3, 5,  16, 8,  21, 24, 4,
-   15, 23, 19, 13, 12, 2, 20, 14, 22, 9,  6,  1
-};
-
 /* update the state with given number of rounds */
 
-static void keccakf(u64 st[25])
+static void __attribute__((__optimize__("O3"))) keccakf(u64 st[25])
 {
-   int i, j, round;
-   u64 t, bc[5];
+   u64 t[5], tt, bc[5];
+   int round;
 
for (round = 0; round < KECCAK_ROUNDS; round++) {
 
/* Theta */
-   for (i = 0; i < 5; i++)
-   bc[i] = st[i] ^ st[i + 5] ^ st[i + 10] ^ st[i + 15]
-   ^ st[i + 20];
-
-   for (i = 0; i < 5; i++) {
-   t = bc[(i + 4) % 5] ^ ROTL64(bc[(i + 1) % 5], 1);
-   for (j = 0; j < 25; j += 5)
-   st[j + i] ^= t;
-   }
+   bc[0] = st[0] ^ st[5] ^ st[10] ^ st[15] ^ st[20];
+   bc[1] = st[1] ^ st[6] ^ st[11] ^ st[16] ^ st[21];
+   bc[2] = st[2] ^ st[7] ^ st[12] ^ st[17] ^ st[22];
+   bc[3] = st[3] ^ st[8] ^ st[13] ^ st[18] ^ st[23];
+   bc[4] = st[4] ^ st[9] ^ st[14] ^ st[19] ^ st[24];
+
+   t[0] = bc[4] ^ rol64(bc[1], 1);
+   t[1] = bc[0] ^ rol64(bc[2], 1);
+   t[2] = bc[1] ^ rol64(bc[3], 1);
+   t[3] = bc[2] ^ rol64(bc[4], 1);
+   t[4] = bc[3] ^ rol64(bc[0], 1);
+
+   st[0] ^= t[0];
 
/* Rho Pi */
-   t = st[1];
-   for (i = 0; i < 24; i++) {
-   j = keccakf_piln[i];
-   bc[0] = st[j];
-   st[j] = ROTL64(t, keccakf_rotc[i]);
-   t = bc[0];
-   }
+   tt = st[1];
+   st[ 1] = rol64(st[ 6] ^ t[1], 44);
+   st[ 6] = rol64(st[ 9] ^ t[4], 20);
+   st[ 9] = rol64(st[22] ^ t[2], 61);
+   st[22] = rol64(st[14] ^ t[4], 39);
+   st[14] = rol64(st[20] ^ t[0], 18);
+   st[20] = rol64(st[ 2] ^ t[2], 62);
+   st[ 2] = rol64(st[12] ^ t[2], 43);
+   st[12] = rol64(st[13] ^ t[3], 25);
+   st[13] = rol64(st[19] ^ t[4],  8);
+   st[19] = rol64(st[23] ^ t[3], 56);
+   st[23] = rol64(st[15] ^ t[0], 41);
+   st[15] = rol64(st[ 4] ^ t[4], 27);
+   st[ 4] = rol64(st[24] ^ t[4], 14);
+   st[24] = rol64(st[21] ^ t[1],  2);
+   st[21] = rol64(st[ 8] ^ t[3], 55);
+   st[ 8] = rol64(st[16] ^ t[1], 45);
+   st[16] = rol64(st[ 5] ^ t[0], 36);
+   st[ 5] = rol64(st[ 3] ^ t[3], 28);
+   st[ 3] = rol64(st[18] ^ t[3]

[PATCH 0/8] crypto: arm64+generic - SHA3/SHA-512/SM-3 roundup

2018-01-19 Thread Ard Biesheuvel

This supersedes all outstanding patches from me related to SHA-3, SHA-512
or SM-3.

- fix a correctness issue in the SHA-3 code (#1) and a performance issue (#2),
  the first one is definitely a -stable candidate, the second one potentially
  as well
- patches #3 and #4 make the generic SHA-3 code reusable as a fallback for the
  accelerated code introduced in #6
- patch #5 adds some SHA-3 test cases
- patch #6 implements SHA-3 using special arm64 instructions
- patch #7 implements the Chinese SM3 secure hash algorithm using special
  arm64 instructions
- patch #8 contains some fixes for the recently queued SHA-512 arm64 code.

Ard Biesheuvel (8):
  crypto/generic: sha3 - fixes for alignment and big endian operation
  crypto/generic: sha3: rewrite KECCAK transform to help the compiler
optimize
  crypto/generic: sha3 - simplify code
  crypto/generic: sha3 - export init/update/final routines
  crypto/testmgr: sha3 - add new testcases
  crypto/arm64: sha3 - new v8.2 Crypto Extensions implementation
  crypto/arm64: sm3 - new v8.2 Crypto Extensions implementation
  crypto/arm64: sha512 - fix/improve new v8.2 Crypto Extensions code

 arch/arm64/crypto/Kconfig  |  12 +
 arch/arm64/crypto/Makefile |   6 +
 arch/arm64/crypto/sha3-ce-core.S   | 210 
 arch/arm64/crypto/sha3-ce-glue.c   | 161 ++
 arch/arm64/crypto/sha512-ce-core.S | 145 +++---
 arch/arm64/crypto/sha512-glue.c|   1 +
 arch/arm64/crypto/sm3-ce-core.S| 141 +
 arch/arm64/crypto/sm3-ce-glue.c|  92 
 crypto/sha3_generic.c  | 332 ++--
 crypto/testmgr.h   | 550 
 include/crypto/sha3.h  |   6 +-
 11 files changed, 1413 insertions(+), 243 deletions(-)
 create mode 100644 arch/arm64/crypto/sha3-ce-core.S
 create mode 100644 arch/arm64/crypto/sha3-ce-glue.c
 create mode 100644 arch/arm64/crypto/sm3-ce-core.S
 create mode 100644 arch/arm64/crypto/sm3-ce-glue.c

-- 
2.11.0

Re: [PATCH v2 0/3] sha3 fixes and new implementation for arm64

2018-01-18 Thread Ard Biesheuvel

On 14 January 2018 at 16:41, Ard Biesheuvel <ard.biesheu...@linaro.org> wrote:
> Add an implementation of SHA3 to arm64 using the new special instructions,
> and another one using scalar instructions but coded in assembler (#2)
>
> In preparation of that, fix a bug in the SHA3 (#1) and add some new test
> vectors to get better test coverage (#3).
>
> v2: Drop generic SHA3 as a fallback for the arm64 module. Instead, provide
> a special arm64 version to use as a fallback when the instructions are
> not available or when executing in a context that does not allow SIMD
>
> Drop patches that simplify the generic SHA3 and make it reusable by
> other modules.
>
> Ard Biesheuvel (3):
>   crypto/generic: sha3 - fixes for alignment and big endian operation
>   crypto/arm64: sha3 - new scalar + v8.2 Crypto Extensions
> implementation
>   crypto/testmgr: sha3 - add new testcases
>
>  arch/arm64/crypto/Kconfig   |   4 +
>  arch/arm64/crypto/Makefile  |   3 +
>  arch/arm64/crypto/sha3-arm64-core.S | 512 ++
>  arch/arm64/crypto/sha3-arm64-glue.c | 192 +++
>  crypto/sha3_generic.c   |   5 +-
>  crypto/testmgr.h| 550 
>  6 files changed, 1264 insertions(+), 2 deletions(-)
>  create mode 100644 arch/arm64/crypto/sha3-arm64-core.S
>  create mode 100644 arch/arm64/crypto/sha3-arm64-glue.c
>

Herbert,

Could you hold off on the SHA-3 patches for a little while? With the
performance fix for the generic code, it may no longer be worthwhile
to have a special arm64 implementation as well. I will respin a series
containing everything I think is needed.

The SM3 patch is independent, and is good to go IMO (with Steve's Tested-by)

Thanks,
Ard.

Re: [PATCH v2] [v2] crypto: aes-generic - fix aes-generic regression on powerpc

2018-01-18 Thread Ard Biesheuvel

On 15 January 2018 at 16:07, Arnd Bergmann <a...@arndb.de> wrote:
> My last bugfix added -Os on the command line, which unfortunately caused
> a build regression on powerpc in some configurations.
>
> I've done some more analysis of the original problem and found slightly
> different workaround that avoids this regression and also results in
> better performance on gcc-7.0: -fcode-hoisting is an optimization step
> that got added in gcc-7 and that for all gcc-7 versions causes worse
> performance.
>
> This disables -fcode-hoisting on all compilers that understand the option.
> For gcc-7.1 and 7.2 I found the same performance as my previous patch
> (using -Os), in gcc-7.0 it was even better. On gcc-8 I could see no
> change in performance from this patch. In theory, code hoisting should
> not be able make things better for the AES cipher, so leaving it
> disabled for gcc-8 only serves to simplify the Makefile change.
>
> Reported-by: kbuild test robot <fengguang...@intel.com>
> Link: https://www.mail-archive.com/linux-crypto@vger.kernel.org/msg30418.html
> Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83356
> Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83651
> Fixes: 148b974deea9 ("crypto: aes-generic - build with -Os on gcc-7+")
> Signed-off-by: Arnd Bergmann <a...@arndb.de>

Acked-by: Ard Biesheuvel <ard.biesheu...@linaro.org>

> ---
> v2: fix a typo in the Makefile
> ---
>  crypto/Makefile | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/crypto/Makefile b/crypto/Makefile
> index daa69360e054..cdbc03b35510 100644
> --- a/crypto/Makefile
> +++ b/crypto/Makefile
> @@ -99,7 +99,7 @@ obj-$(CONFIG_CRYPTO_TWOFISH_COMMON) += twofish_common.o
>  obj-$(CONFIG_CRYPTO_SERPENT) += serpent_generic.o
>  CFLAGS_serpent_generic.o := $(call cc-option,-fsched-pressure)  # 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149
>  obj-$(CONFIG_CRYPTO_AES) += aes_generic.o
> -CFLAGS_aes_generic.o := $(call cc-ifversion, -ge, 0701, -Os) # 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83356
> +CFLAGS_aes_generic.o := $(call cc-option,-fno-code-hoisting) # 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83356
>  obj-$(CONFIG_CRYPTO_AES_TI) += aes_ti.o
>  obj-$(CONFIG_CRYPTO_CAMELLIA) += camellia_generic.o
>  obj-$(CONFIG_CRYPTO_CAST_COMMON) += cast_common.o
> --
> 2.9.0
>

Re: [PATCH 0/7] arm64: move literal data into .rodata section

2018-01-18 Thread Ard Biesheuvel

On 18 January 2018 at 11:41, Herbert Xu <herb...@gondor.apana.org.au> wrote:
> On Wed, Jan 10, 2018 at 12:11:35PM +0000, Ard Biesheuvel wrote:
>> Prevent inadvertently creating speculative gadgets by moving literal data
>> into the .rodata section.
>>
>> Patch #1 enables this for C code, by reverting a change that disables the
>> GCC feature implementing this. Note that this conflicts with the mitigation
>> of erratum #843419 for Cortex-A53.
>
> Ard, which tree is this supposed to go through?
>

Hi Herbert,

I am going to drop that first patch, the remaining 6 patches can go
through the crypto tree as they are independent.

Thanks,
Ard.

Re: [PATCH 0/5] sha3 fixes and new implementation for arm64

2018-01-16 Thread Ard Biesheuvel

On 16 January 2018 at 08:41, Steve Capper <steve.cap...@arm.com> wrote:
> On Fri, Jan 12, 2018 at 03:13:56PM +0000, Ard Biesheuvel wrote:
>> On 12 January 2018 at 13:15, Ard Biesheuvel <ard.biesheu...@linaro.org> 
>> wrote:
>> > Add an implementation of SHA3 to arm64 using the new special instructions 
>> > (#4)
>> >
>> > In preparation of that, fix a bug in the SHA3 and refactor it a bit so it
>> > can serve as a fallback for the other code. Also, add some new test vectors
>> > to get better test coverage.
>> >
>> > Ard Biesheuvel (5):
>> >   crypto/generic: sha3 - fixes for alignment and big endian operation
>> >   crypto/generic: sha3 - simplify code
>> >   crypto/generic: sha3 - export init/update/final routines
>> >   crypto/arm64: sha3 - new implementation based on special instructions
>>
>> Forgot to mention: this is an RFT for patch #4, as it has not been
>> validated against a real implementation, only against my own QEMU
>> code.
>
> Hi Ard,
> I have tested this patch set applied to 4.15-rc7 running in a model.
>
> I used the following tcrypt modes:
> 48, 49, 50, 51, 111, 112, 113, 114, 187, 188, 322, 323, 324, 325, 418,
> 419, 420 and 421.
>
> Also, I added some logic to double check that sha3_ce_transform(.)
> was being called rather than sha3_scalar_transform(.).
> (Because both the scalar and ce code paths are contained in the
> sha3-x-arm64 drivers).
>
> So, please feel free to add for the series:
> Tested-by: Steve Capper <steve.cap...@arm.com>
>

Thanks Steve!

[RFT PATCH] crypto: arm64 - implement SM3 secure hash using special instructions

2018-01-16 Thread Ard Biesheuvel

Implement the Chinese SM3 secure hash algorithm using the new
special instructions that have been introduced as an optional
extension in ARMv8.2.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/Kconfig   |   5 ++
 arch/arm64/crypto/Makefile  |   3 +
 arch/arm64/crypto/sm3-ce-core.S | 142 
 arch/arm64/crypto/sm3-ce-glue.c |  92 ++
 4 files changed, 242 insertions(+)
 create mode 100644 arch/arm64/crypto/sm3-ce-core.S
 create mode 100644 arch/arm64/crypto/sm3-ce-glue.c

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 71293e049a5d..225c3842644c 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -105,4 +105,9 @@ config CRYPTO_AES_ARM64_BS
select CRYPTO_AES_ARM64
select CRYPTO_SIMD
 
+config CRYPTO_SM3_ARM64_CE
+   tristate "SM3 digest algorithm (ARMv8.2 Crypto Extensions)"
+   depends on KERNEL_MODE_NEON
+   select CRYPTO_HASH
+
 endif
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 267764473ef6..989d6e51f6c9 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -56,6 +56,9 @@ aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
 obj-$(CONFIG_CRYPTO_AES_ARM64_BS) += aes-neon-bs.o
 aes-neon-bs-y := aes-neonbs-core.o aes-neonbs-glue.o
 
+obj-$(CONFIG_CRYPTO_SM3_ARM64_CE) += sm3-ce.o
+sm3-ce-y := sm3-ce-glue.o sm3-ce-core.o
+
 AFLAGS_aes-ce.o:= -DINTERLEAVE=4
 AFLAGS_aes-neon.o  := -DINTERLEAVE=4
 
diff --git a/arch/arm64/crypto/sm3-ce-core.S b/arch/arm64/crypto/sm3-ce-core.S
new file mode 100644
index ..961d01764886
--- /dev/null
+++ b/arch/arm64/crypto/sm3-ce-core.S
@@ -0,0 +1,142 @@
+/*
+ * sm3-ce-core.S - SM3 secure hash using ARMv8.2 Crypto Extensions
+ *
+ * Copyright (C) 2018 Linaro Ltd <ard.biesheu...@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+
+   .irpb,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
+   .set.Lv\b\().4s, \b
+   .endr
+
+   .macro  sm3partw1, rd, rn, rm
+   .inst   0xce60c000 | .L\rd | (.L\rn << 5) | (.L\rm << 16)
+   .endm
+
+   .macro  sm3partw2, rd, rn, rm
+   .inst   0xce60c400 | .L\rd | (.L\rn << 5) | (.L\rm << 16)
+   .endm
+
+   .macro  sm3ss1, rd, rn, rm, ra
+   .inst   0xce40 | .L\rd | (.L\rn << 5) | (.L\ra << 10) | 
(.L\rm << 16)
+   .endm
+
+   .macro  sm3tt1a, rd, rn, rm, imm2
+   .inst   0xce408000 | .L\rd | (.L\rn << 5) | ((\imm2) << 12) | 
(.L\rm << 16)
+   .endm
+
+   .macro  sm3tt1b, rd, rn, rm, imm2
+   .inst   0xce408400 | .L\rd | (.L\rn << 5) | ((\imm2) << 12) | 
(.L\rm << 16)
+   .endm
+
+   .macro  sm3tt2a, rd, rn, rm, imm2
+   .inst   0xce408800 | .L\rd | (.L\rn << 5) | ((\imm2) << 12) | 
(.L\rm << 16)
+   .endm
+
+   .macro  sm3tt2b, rd, rn, rm, imm2
+   .inst   0xce408c00 | .L\rd | (.L\rn << 5) | ((\imm2) << 12) | 
(.L\rm << 16)
+   .endm
+
+   .macro  round, ab, s0, t0, t1, i
+   sm3ss1  v10.4s, v8.4s, \t0\().4s, v9.4s
+   shl \t1\().4s, \t0\().4s, #1
+   sri \t1\().4s, \t0\().4s, #31
+   sm3tt1\ab   v8.4s, v10.4s, v15.4s, \i
+   sm3tt2\ab   v9.4s, v10.4s, \s0\().4s, \i
+   .endm
+
+   .macro  qround, ab, s0, s1, s2, s3, s4
+   .ifnb   \s4
+   ext \s4\().16b, \s1\().16b, \s2\().16b, #12
+   ext v13.16b, \s0\().16b, \s1\().16b, #12
+   ext v14.16b, \s2\().16b, \s3\().16b, #8
+   sm3partw1   \s4\().4s, \s0\().4s, \s3\().4s
+   .endif
+
+   eor v15.16b, \s0\().16b, \s1\().16b
+
+   round   \ab, \s0, v11, v12, 0
+   round   \ab, \s0, v12, v11, 1
+   round   \ab, \s0, v11, v12, 2
+   round   \ab, \s0, v12, v11, 3
+
+   .ifnb   \s4
+   sm3partw2   \s4\().4s, v14.4s, v13.4s
+   .endif
+   .endm
+
+   /*
+* void sm3_ce_transform(struct sm3_state *sst, u8 const *src,
+*   int blocks)
+*/
+   .text
+ENTRY(sm3_ce_transform)
+   /* load state */
+   ld1 {v8.4s-v9.4s}, [x0]
+   rev64   v8.4s, v8.4s
+   rev64   v9.4s, v9.4s
+   ext v8.16b, v8.16b, v8.16b, #8
+   ext v9.16b, v9.16b, v9.16b, #8
+
+   adrpx8, .Lt
+   ldr s16, [x8, :lo12:.Lt]
+   ldr s

Re: [RFT PATCH] crypto: arm64 - implement SHA-512 using special instructions

2018-01-16 Thread Ard Biesheuvel

On 16 January 2018 at 08:16, Steve Capper <steve.cap...@arm.com> wrote:
> On Tue, Jan 09, 2018 at 06:23:02PM +0000, Ard Biesheuvel wrote:
>> Implement the SHA-512 using the new special instructions that have
>> been introduced as an optional extension in ARMv8.2.
>
> Hi Ard,
> I have tested this applied on top of 4.15-rc7 running in a model.
>
> For sha512-ce, I verified that tcrypt successfully passed tests for modes:
> 12, 104, 189, 190, 306, 406 and 424.
> (and I double checked that sha512-ce was being used).
>
> Similarly for sha384-ce, I tested the following modes:
> 11, 103, 187, 188, 305 and 405.
>
> Also, I had:
> CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=n
>
> So FWIW, please feel free to add:
> Tested-by: Steve Capper <steve.cap...@arm.com>
>

Excellent! Thanks a lot Steve.

[PATCH] crypto/generic: sha3: rewrite KECCAK transform to help the GCC optimizer

2018-01-15 Thread Ard Biesheuvel

The way the KECCAK transform is currently coded involves many references
into the state array using indexes that are calculated at runtime using
simple but non-trivial arithmetic. This forces the compiler to treat the
state matrix as an array in memory rather than keep it in registers,
which results in poor performance.

So instead, let's rephrase the algorithm using fixed array indexes only.
This helps the compiler keep the state matrix in registers, resulting
in the following speedup (SHA3-256 performance in cycles per byte):

before   after   speedup
  Intel Core i7 @ 2.0 GHz (2.9 turbo)100.635.7 2.8x
  Cortex-A57 @ 2.0 GHz (64-bit mode) 101.612.7 8.0x
  Cortex-A53 @ 1.0 GHz   224.415.814.2x
  Cortex-A57 @ 2.0 GHz (32-bit mode) 201.863.0 3.2x

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
Raw tcrypt performance numbers after the patch.

 crypto/sha3_generic.c | 134 ++--
 1 file changed, 96 insertions(+), 38 deletions(-)

diff --git a/crypto/sha3_generic.c b/crypto/sha3_generic.c
index a68be626017c..5fecb609e3be 100644
--- a/crypto/sha3_generic.c
+++ b/crypto/sha3_generic.c
@@ -5,6 +5,7 @@
  * http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf
  *
  * SHA-3 code by Jeff Garzik <j...@garzik.org>
+ *       Ard Biesheuvel <ard.biesheu...@linaro.org>
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of the GNU General Public License as published by the Free
@@ -22,8 +23,6 @@
 
 #define KECCAK_ROUNDS 24
 
-#define ROTL64(x, y) (((x) << (y)) | ((x) >> (64 - (y
-
 static const u64 keccakf_rndc[24] = {
0x0001ULL, 0x8082ULL, 0x8000808aULL,
0x800080008000ULL, 0x808bULL, 0x8001ULL,
@@ -35,53 +34,112 @@ static const u64 keccakf_rndc[24] = {
0x80008080ULL, 0x8001ULL, 0x800080008008ULL
 };
 
-static const int keccakf_rotc[24] = {
-   1,  3,  6,  10, 15, 21, 28, 36, 45, 55, 2,  14,
-   27, 41, 56, 8,  25, 43, 62, 18, 39, 61, 20, 44
-};
-
-static const int keccakf_piln[24] = {
-   10, 7,  11, 17, 18, 3, 5,  16, 8,  21, 24, 4,
-   15, 23, 19, 13, 12, 2, 20, 14, 22, 9,  6,  1
-};
-
 /* update the state with given number of rounds */
 
-static void keccakf(u64 st[25])
+static void __attribute__((__optimize__("O3"))) keccakf(u64 st[25])
 {
-   int i, j, round;
-   u64 t, bc[5];
+   u64 t[5], tt, bc[5];
+   int round;
 
for (round = 0; round < KECCAK_ROUNDS; round++) {
 
/* Theta */
-   for (i = 0; i < 5; i++)
-   bc[i] = st[i] ^ st[i + 5] ^ st[i + 10] ^ st[i + 15]
-   ^ st[i + 20];
-
-   for (i = 0; i < 5; i++) {
-   t = bc[(i + 4) % 5] ^ ROTL64(bc[(i + 1) % 5], 1);
-   for (j = 0; j < 25; j += 5)
-   st[j + i] ^= t;
-   }
+   bc[0] = st[0] ^ st[5] ^ st[10] ^ st[15] ^ st[20];
+   bc[1] = st[1] ^ st[6] ^ st[11] ^ st[16] ^ st[21];
+   bc[2] = st[2] ^ st[7] ^ st[12] ^ st[17] ^ st[22];
+   bc[3] = st[3] ^ st[8] ^ st[13] ^ st[18] ^ st[23];
+   bc[4] = st[4] ^ st[9] ^ st[14] ^ st[19] ^ st[24];
+
+   t[0] = bc[4] ^ rol64(bc[1], 1);
+   t[1] = bc[0] ^ rol64(bc[2], 1);
+   t[2] = bc[1] ^ rol64(bc[3], 1);
+   t[3] = bc[2] ^ rol64(bc[4], 1);
+   t[4] = bc[3] ^ rol64(bc[0], 1);
+
+   st[0] ^= t[0];
 
/* Rho Pi */
-   t = st[1];
-   for (i = 0; i < 24; i++) {
-   j = keccakf_piln[i];
-   bc[0] = st[j];
-   st[j] = ROTL64(t, keccakf_rotc[i]);
-   t = bc[0];
-   }
+   tt = st[1];
+   st[ 1] = rol64(st[ 6] ^ t[1], 44);
+   st[ 6] = rol64(st[ 9] ^ t[4], 20);
+   st[ 9] = rol64(st[22] ^ t[2], 61);
+   st[22] = rol64(st[14] ^ t[4], 39);
+   st[14] = rol64(st[20] ^ t[0], 18);
+   st[20] = rol64(st[ 2] ^ t[2], 62);
+   st[ 2] = rol64(st[12] ^ t[2], 43);
+   st[12] = rol64(st[13] ^ t[3], 25);
+   st[13] = rol64(st[19] ^ t[4],  8);
+   st[19] = rol64(st[23] ^ t[3], 56);
+   st[23] = rol64(st[15] ^ t[0], 41);
+   st[15] = rol64(st[ 4] ^ t[4], 27);
+   st[ 4] = rol64(st[24] ^ t[4], 14);
+   st[24] = rol64(st[21] ^ t[1],  2);
+   st[21] = rol64(st[ 8] ^ t[3], 55);
+   st[ 8] = rol64(st[16] ^ t[1], 45);
+   st[16] = rol64(st[ 5] ^ t[0], 36);
+   st[ 5] = rol64(st[ 3] ^ t[3]

Re: [PATCH v2 1/3] crypto/generic: sha3 - fixes for alignment and big endian operation

2018-01-15 Thread Ard Biesheuvel

On 15 January 2018 at 05:53, Chris Moore <mo...@free.fr> wrote:
> Hi,
>
> Le 14/01/2018 à 17:41, Ard Biesheuvel a écrit :
>>
>> Ensure that the input is byte swabbed before injecting it into the
>
>
> Nitpick : s/swabbed/swapped/
>

Thanks Chris - byte swapping is often referred to as swabbing, but I
guess 'byte swabbing' is redundant regardless.

>> SHA3 transform. Use the get_unaligned() accessor for this so that
>> we don't perform unaligned access inadvertently on architectures
>> that do not support that.
>>
>> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
>
>
> Cheers,
> Chris
>

[PATCH v2 1/3] crypto/generic: sha3 - fixes for alignment and big endian operation

2018-01-14 Thread Ard Biesheuvel

Ensure that the input is byte swabbed before injecting it into the
SHA3 transform. Use the get_unaligned() accessor for this so that
we don't perform unaligned access inadvertently on architectures
that do not support that.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 crypto/sha3_generic.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/crypto/sha3_generic.c b/crypto/sha3_generic.c
index 7e8ed96236ce..a68be626017c 100644
--- a/crypto/sha3_generic.c
+++ b/crypto/sha3_generic.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define KECCAK_ROUNDS 24
 
@@ -149,7 +150,7 @@ static int sha3_update(struct shash_desc *desc, const u8 
*data,
unsigned int i;
 
for (i = 0; i < sctx->rsizw; i++)
-   sctx->st[i] ^= ((u64 *) src)[i];
+   sctx->st[i] ^= get_unaligned_le64(src + 8 * i);
keccakf(sctx->st);
 
done += sctx->rsiz;
@@ -174,7 +175,7 @@ static int sha3_final(struct shash_desc *desc, u8 *out)
sctx->buf[sctx->rsiz - 1] |= 0x80;
 
for (i = 0; i < sctx->rsizw; i++)
-   sctx->st[i] ^= ((u64 *) sctx->buf)[i];
+   sctx->st[i] ^= get_unaligned_le64(sctx->buf + 8 * i);
 
keccakf(sctx->st);
 
-- 
2.11.0

[PATCH v2 3/3] crypto/testmgr: sha3 - add new testcases

2018-01-14 Thread Ard Biesheuvel

All current SHA3 test cases are smaller than the SHA3 block size, which
means not all code paths are being exercised. So add a new test case to
each variant, and make one of the existing test cases chunked.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 crypto/testmgr.h | 550 
 1 file changed, 550 insertions(+)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index a714b6293959..6044f6906bd6 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -1052,6 +1052,142 @@ static const struct hash_testvec sha3_224_tv_template[] 
= {
"\xc9\xfd\x55\x74\x49\x44\x79\xba"
"\x5c\x7e\x7a\xb7\x6e\xf2\x64\xea"
"\xd0\xfc\xce\x33",
+   .np = 2,
+   .tap= { 28, 28 },
+   }, {
+   .plaintext = "\x08\x9f\x13\xaa\x41\xd8\x4c\xe3"
+"\x7a\x11\x85\x1c\xb3\x27\xbe\x55"
+"\xec\x60\xf7\x8e\x02\x99\x30\xc7"
+"\x3b\xd2\x69\x00\x74\x0b\xa2\x16"
+"\xad\x44\xdb\x4f\xe6\x7d\x14\x88"
+"\x1f\xb6\x2a\xc1\x58\xef\x63\xfa"
+"\x91\x05\x9c\x33\xca\x3e\xd5\x6c"
+"\x03\x77\x0e\xa5\x19\xb0\x47\xde"
+"\x52\xe9\x80\x17\x8b\x22\xb9\x2d"
+"\xc4\x5b\xf2\x66\xfd\x94\x08\x9f"
+"\x36\xcd\x41\xd8\x6f\x06\x7a\x11"
+"\xa8\x1c\xb3\x4a\xe1\x55\xec\x83"
+"\x1a\x8e\x25\xbc\x30\xc7\x5e\xf5"
+"\x69\x00\x97\x0b\xa2\x39\xd0\x44"
+"\xdb\x72\x09\x7d\x14\xab\x1f\xb6"
+"\x4d\xe4\x58\xef\x86\x1d\x91\x28"
+"\xbf\x33\xca\x61\xf8\x6c\x03\x9a"
+"\x0e\xa5\x3c\xd3\x47\xde\x75\x0c"
+"\x80\x17\xae\x22\xb9\x50\xe7\x5b"
+"\xf2\x89\x20\x94\x2b\xc2\x36\xcd"
+"\x64\xfb\x6f\x06\x9d\x11\xa8\x3f"
+"\xd6\x4a\xe1\x78\x0f\x83\x1a\xb1"
+"\x25\xbc\x53\xea\x5e\xf5\x8c\x00"
+"\x97\x2e\xc5\x39\xd0\x67\xfe\x72"
+"\x09\xa0\x14\xab\x42\xd9\x4d\xe4"
+"\x7b\x12\x86\x1d\xb4\x28\xbf\x56"
+"\xed\x61\xf8\x8f\x03\x9a\x31\xc8"
+"\x3c\xd3\x6a\x01\x75\x0c\xa3\x17"
+"\xae\x45\xdc\x50\xe7\x7e\x15\x89"
+"\x20\xb7\x2b\xc2\x59\xf0\x64\xfb"
+"\x92\x06\x9d\x34\xcb\x3f\xd6\x6d"
+"\x04\x78\x0f\xa6\x1a\xb1\x48\xdf"
+"\x53\xea\x81\x18\x8c\x23\xba\x2e"
+"\xc5\x5c\xf3\x67\xfe\x95\x09\xa0"
+"\x37\xce\x42\xd9\x70\x07\x7b\x12"
+"\xa9\x1d\xb4\x4b\xe2\x56\xed\x84"
+"\x1b\x8f\x26\xbd\x31\xc8\x5f\xf6"
+"\x6a\x01\x98\x0c\xa3\x3a\xd1\x45"
+"\xdc\x73\x0a\x7e\x15\xac\x20\xb7"
+"\x4e\xe5\x59\xf0\x87\x1e\x92\x29"
+"\xc0\x34\xcb\x62\xf9\x6d\x04\x9b"
+"\x0f\xa6\x3d\xd4\x48\xdf\x76\x0d"
+"\x81\x18\xaf\x23\xba\x51\xe8\x5c"
+"\xf3\x8a\x21\x95\x2c\xc3\x37\xce"
+"\x65\xfc\x70\x07\x9e\x12\xa9\x40"
+"\xd7\x4b\xe2\x79\x10\x84\x1b\xb2"
+"\x26\xbd\x54\xeb\x5f\xf6\x8d\x01"
+"\x98\x2f\xc6\x3a\xd1\x68\xff\x73"
+"\x0a\xa1\x15\xac\x43\xda\x4e\xe5"
+"\x7c\x13\x87\x1e\xb5\x29\xc0\x57"
+"\xee\x62\xf9\x90\x04\x9b\x32\xc9"
+"\x3d\xd4\x6b\x02\x76\x0d\xa4\x18"
+"\xaf\x46\xdd\x51\xe8\x7f\x16\x8a"
+"\x21\xb8\x2c\xc3\x5a\xf1\x65\xfc"
+"\x93\x07\x9e\x35\xcc\x40\xd7\x6e"
+"\x05\x79\x10\xa7\

[PATCH v2 0/3] sha3 fixes and new implementation for arm64

2018-01-14 Thread Ard Biesheuvel

Add an implementation of SHA3 to arm64 using the new special instructions,
and another one using scalar instructions but coded in assembler (#2)

In preparation of that, fix a bug in the SHA3 (#1) and add some new test
vectors to get better test coverage (#3).

v2: Drop generic SHA3 as a fallback for the arm64 module. Instead, provide
a special arm64 version to use as a fallback when the instructions are
not available or when executing in a context that does not allow SIMD

Drop patches that simplify the generic SHA3 and make it reusable by
other modules.

Ard Biesheuvel (3):
  crypto/generic: sha3 - fixes for alignment and big endian operation
  crypto/arm64: sha3 - new scalar + v8.2 Crypto Extensions
implementation
  crypto/testmgr: sha3 - add new testcases

 arch/arm64/crypto/Kconfig   |   4 +
 arch/arm64/crypto/Makefile  |   3 +
 arch/arm64/crypto/sha3-arm64-core.S | 512 ++
 arch/arm64/crypto/sha3-arm64-glue.c | 192 +++
 crypto/sha3_generic.c   |   5 +-
 crypto/testmgr.h| 550 
 6 files changed, 1264 insertions(+), 2 deletions(-)
 create mode 100644 arch/arm64/crypto/sha3-arm64-core.S
 create mode 100644 arch/arm64/crypto/sha3-arm64-glue.c

-- 
2.11.0

[PATCH v2 2/3] crypto/arm64: sha3 - new scalar + v8.2 Crypto Extensions implementation

2018-01-14 Thread Ard Biesheuvel

Implement the various flavours of SHA3 using scalar instructions, and
using the new optional EOR3/RAX1/XAR/BCAX instructions introduced by
ARMv8.2.

Note that the scalar asm version is *much* faster than the C based
generic implementation: the SHA3 state matrix already occupies 25
registers, leaving very little to perform the computation, and the
compiler appears to give up and spill the state to memory.

  Performance comparison of SHA3-256 (cycles per byte)

generic scalar arm64 speedup
  Cortex-A53 @ 1GHz224.4 cpb  12.4 cpb18.1x
  Cortex-A57 @ 2GHz101.6 cpb  11.8 cpb 8.6x

The ARMv8.2 version has only been tested against emulators, so no
performance data is available yet.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/Kconfig   |   4 +
 arch/arm64/crypto/Makefile  |   3 +
 arch/arm64/crypto/sha3-arm64-core.S | 512 
 arch/arm64/crypto/sha3-arm64-glue.c | 192 
 4 files changed, 711 insertions(+)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index aad288f4b9de..71293e049a5d 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -35,6 +35,10 @@ config CRYPTO_SHA512_ARM64_CE
select CRYPTO_HASH
select CRYPTO_SHA512_ARM64
 
+config CRYPTO_SHA3_ARM64
+   tristate "SHA3 digest algorithm (scalar + ARMv8.2 Crypto Extensions)"
+   select CRYPTO_HASH
+
 config CRYPTO_GHASH_ARM64_CE
tristate "GHASH/AES-GCM using ARMv8 Crypto Extensions"
depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index d7573d31d397..267764473ef6 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -17,6 +17,9 @@ sha2-ce-y := sha2-ce-glue.o sha2-ce-core.o
 obj-$(CONFIG_CRYPTO_SHA512_ARM64_CE) += sha512-ce.o
 sha512-ce-y := sha512-ce-glue.o sha512-ce-core.o
 
+obj-$(CONFIG_CRYPTO_SHA3_ARM64) += sha3-arm64.o
+sha3-arm64-y := sha3-arm64-glue.o sha3-arm64-core.o
+
 obj-$(CONFIG_CRYPTO_GHASH_ARM64_CE) += ghash-ce.o
 ghash-ce-y := ghash-ce-glue.o ghash-ce-core.o
 
diff --git a/arch/arm64/crypto/sha3-arm64-core.S 
b/arch/arm64/crypto/sha3-arm64-core.S
new file mode 100644
index ..e32f1e3e5b42
--- /dev/null
+++ b/arch/arm64/crypto/sha3-arm64-core.S
@@ -0,0 +1,512 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * sha3-arm64-core.S - core SHA-3 transform using scalar or v8.2 Crypto
+ * Extensions instructions
+ *
+ * Copyright (C) 2018 Linaro Ltd <ard.biesheu...@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+
+   /*
+* sha3_arm64_transform(u64 *st, const u8 *data, int blocks, int 
dg_size)
+*/
+   .align  4
+ENTRY(sha3_arm64_transform)
+   /* preserve callee save registers - no room for a frame pointer! */
+   stp x29, x30, [sp, #-144]!
+   stp x19, x20, [sp, #16]
+   stp x21, x22, [sp, #32]
+   stp x23, x24, [sp, #48]
+   stp x25, x26, [sp, #64]
+   stp x27, x28, [sp, #80]
+
+   stp  x0, x1, [sp, #96]  // preserve st, data
+   str  x3, [sp, #112] // preserve dg_size
+   mov x30, x2 // preserve #blocks
+
+   /* load state */
+   mov x25,  x0
+   ldp  x0,  x1, [x0]
+   ldp  x2,  x3, [x25, #16]
+   ldp  x4,  x5, [x25, #32]
+   ldp  x6,  x7, [x25, #48]
+   ldp  x8,  x9, [x25, #64]
+   ldp x10, x11, [x25, #80]
+   ldp x12, x13, [x25, #96]
+   ldp x14, x15, [x25, #112]
+   ldp x16, x17, [x25, #128]
+   ldp x18, x19, [x25, #144]
+   ldp x20, x21, [x25, #160]
+   ldp x22, x23, [x25, #176]
+   ldr x24, [x25, #192]
+
+0: adr_l   x29, .Lsha3_rcon + 72
+   stp x29, x30, [sp, #120]// preserve rc pointer, #blocks
+   ldp x29, x30, [sp, #104]// load data, dg_size
+
+   /* load input */
+   ldp x25, x26, [x29], #32
+   ldp x27, x28, [x29, #-16]
+CPU_BE(rev x25, x25)
+CPU_BE(rev x26, x26)
+CPU_BE(rev x27, x27)
+CPU_BE(rev x28, x28)
+   eor  x0,  x0, x25
+   eor  x1,  x1, x26
+   eor  x2,  x2, x27
+   eor  x3,  x3, x28
+
+   ldp x25, x26, [x29], #24
+   ldr x27, [x29, #-8]
+CPU_BE(

Re: [PATCH 0/5] sha3 fixes and new implementation for arm64

2018-01-12 Thread Ard Biesheuvel

On 12 January 2018 at 13:15, Ard Biesheuvel <ard.biesheu...@linaro.org> wrote:
> Add an implementation of SHA3 to arm64 using the new special instructions (#4)
>
> In preparation of that, fix a bug in the SHA3 and refactor it a bit so it
> can serve as a fallback for the other code. Also, add some new test vectors
> to get better test coverage.
>
> Ard Biesheuvel (5):
>   crypto/generic: sha3 - fixes for alignment and big endian operation
>   crypto/generic: sha3 - simplify code
>   crypto/generic: sha3 - export init/update/final routines
>   crypto/arm64: sha3 - new implementation based on special instructions

Forgot to mention: this is an RFT for patch #4, as it has not been
validated against a real implementation, only against my own QEMU
code.

>   crypto/testmgr: sha3 - add new testcases
>
>  arch/arm64/crypto/Kconfig|   6 +
>  arch/arm64/crypto/Makefile   |   3 +
>  arch/arm64/crypto/sha3-ce-core.S | 224 
>  arch/arm64/crypto/sha3-ce-glue.c | 156 ++
>  crypto/sha3_generic.c| 198 +++
>  crypto/testmgr.h | 550 
>  include/crypto/sha3.h|   6 +-
>  7 files changed, 1012 insertions(+), 131 deletions(-)
>  create mode 100644 arch/arm64/crypto/sha3-ce-core.S
>  create mode 100644 arch/arm64/crypto/sha3-ce-glue.c
>
> --
> 2.11.0
>

[PATCH 0/5] sha3 fixes and new implementation for arm64

2018-01-12 Thread Ard Biesheuvel

Add an implementation of SHA3 to arm64 using the new special instructions (#4)

In preparation of that, fix a bug in the SHA3 and refactor it a bit so it
can serve as a fallback for the other code. Also, add some new test vectors
to get better test coverage.

Ard Biesheuvel (5):
  crypto/generic: sha3 - fixes for alignment and big endian operation
  crypto/generic: sha3 - simplify code
  crypto/generic: sha3 - export init/update/final routines
  crypto/arm64: sha3 - new implementation based on special instructions
  crypto/testmgr: sha3 - add new testcases

 arch/arm64/crypto/Kconfig|   6 +
 arch/arm64/crypto/Makefile   |   3 +
 arch/arm64/crypto/sha3-ce-core.S | 224 
 arch/arm64/crypto/sha3-ce-glue.c | 156 ++
 crypto/sha3_generic.c| 198 +++
 crypto/testmgr.h | 550 
 include/crypto/sha3.h|   6 +-
 7 files changed, 1012 insertions(+), 131 deletions(-)
 create mode 100644 arch/arm64/crypto/sha3-ce-core.S
 create mode 100644 arch/arm64/crypto/sha3-ce-glue.c

-- 
2.11.0

[PATCH 3/5] crypto/generic: sha3 - export init/update/final routines

2018-01-12 Thread Ard Biesheuvel

To allow accelerated implementations to fall back to the generic
routines, e.g., in contexts where a SIMD based implementation is
not allowed to run, expose the generic SHA3 init/update/final
routines to other modules.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 crypto/sha3_generic.c | 33 +++-
 include/crypto/sha3.h |  5 +++
 2 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/crypto/sha3_generic.c b/crypto/sha3_generic.c
index 677247d429a1..86db5baafc83 100644
--- a/crypto/sha3_generic.c
+++ b/crypto/sha3_generic.c
@@ -87,7 +87,7 @@ static void keccakf(u64 st[25])
}
 }
 
-static int sha3_init(struct shash_desc *desc)
+int crypto_sha3_init(struct shash_desc *desc)
 {
struct sha3_state *sctx = shash_desc_ctx(desc);
unsigned int digest_size = crypto_shash_digestsize(desc->tfm);
@@ -99,8 +99,9 @@ static int sha3_init(struct shash_desc *desc)
memset(sctx->st, 0, sizeof(sctx->st));
return 0;
 }
+EXPORT_SYMBOL(crypto_sha3_init);
 
-static int sha3_update(struct shash_desc *desc, const u8 *data,
+int crypto_sha3_update(struct shash_desc *desc, const u8 *data,
   unsigned int len)
 {
struct sha3_state *sctx = shash_desc_ctx(desc);
@@ -136,8 +137,9 @@ static int sha3_update(struct shash_desc *desc, const u8 
*data,
 
return 0;
 }
+EXPORT_SYMBOL(crypto_sha3_update);
 
-static int sha3_final(struct shash_desc *desc, u8 *out)
+int crypto_sha3_final(struct shash_desc *desc, u8 *out)
 {
struct sha3_state *sctx = shash_desc_ctx(desc);
unsigned int i, inlen = sctx->partial;
@@ -162,12 +164,13 @@ static int sha3_final(struct shash_desc *desc, u8 *out)
memset(sctx, 0, sizeof(*sctx));
return 0;
 }
+EXPORT_SYMBOL(crypto_sha3_final);
 
 static struct shash_alg algs[] = { {
.digestsize = SHA3_224_DIGEST_SIZE,
-   .init   = sha3_init,
-   .update = sha3_update,
-   .final  = sha3_final,
+   .init   = crypto_sha3_init,
+   .update = crypto_sha3_update,
+   .final  = crypto_sha3_final,
.descsize   = sizeof(struct sha3_state),
.base.cra_name  = "sha3-224",
.base.cra_driver_name   = "sha3-224-generic",
@@ -176,9 +179,9 @@ static struct shash_alg algs[] = { {
.base.cra_module= THIS_MODULE,
 }, {
.digestsize = SHA3_256_DIGEST_SIZE,
-   .init   = sha3_init,
-   .update = sha3_update,
-   .final  = sha3_final,
+   .init   = crypto_sha3_init,
+   .update = crypto_sha3_update,
+   .final  = crypto_sha3_final,
.descsize   = sizeof(struct sha3_state),
.base.cra_name  = "sha3-256",
.base.cra_driver_name   = "sha3-256-generic",
@@ -187,9 +190,9 @@ static struct shash_alg algs[] = { {
.base.cra_module= THIS_MODULE,
 }, {
.digestsize = SHA3_384_DIGEST_SIZE,
-   .init   = sha3_init,
-   .update = sha3_update,
-   .final  = sha3_final,
+   .init   = crypto_sha3_init,
+   .update = crypto_sha3_update,
+   .final  = crypto_sha3_final,
.descsize   = sizeof(struct sha3_state),
.base.cra_name  = "sha3-384",
.base.cra_driver_name   = "sha3-384-generic",
@@ -198,9 +201,9 @@ static struct shash_alg algs[] = { {
.base.cra_module= THIS_MODULE,
 }, {
.digestsize = SHA3_512_DIGEST_SIZE,
-   .init   = sha3_init,
-   .update = sha3_update,
-   .final  = sha3_final,
+   .init   = crypto_sha3_init,
+   .update = crypto_sha3_update,
+   .final  = crypto_sha3_final,
.descsize   = sizeof(struct sha3_state),
.base.cra_name  = "sha3-512",
.base.cra_driver_name   = "sha3-512-generic",
diff --git a/include/crypto/sha3.h b/include/crypto/sha3.h
index 1339dcdbc9b2..080f60c2e6b1 100644
--- a/include/crypto/sha3.h
+++ b/include/crypto/sha3.h
@@ -26,4 +26,9 @@ struct sha3_state {
u8  buf[SHA3_224_BLOCK_SIZE];
 };
 
+int crypto_sha3_init(struct shash_desc *desc);
+int crypto_sha3_update(struct shash_desc *desc, const u8 *data,
+  unsigned int len);
+int crypto_sha3_final(struct shash_desc *desc, u8 *out);
+
 #endif
-- 
2.11.0

[PATCH 1/5] crypto/generic: sha3 - fixes for alignment and big endian operation

2018-01-12 Thread Ard Biesheuvel

Ensure that the input is byte swabbed before injecting it into the
SHA3 transform. Use the get_unaligned() accessor for this so that
we don't perform unaligned access inadvertently on architectures
that do not support that.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 crypto/sha3_generic.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/crypto/sha3_generic.c b/crypto/sha3_generic.c
index 7e8ed96236ce..a68be626017c 100644
--- a/crypto/sha3_generic.c
+++ b/crypto/sha3_generic.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define KECCAK_ROUNDS 24
 
@@ -149,7 +150,7 @@ static int sha3_update(struct shash_desc *desc, const u8 
*data,
unsigned int i;
 
for (i = 0; i < sctx->rsizw; i++)
-   sctx->st[i] ^= ((u64 *) src)[i];
+   sctx->st[i] ^= get_unaligned_le64(src + 8 * i);
keccakf(sctx->st);
 
done += sctx->rsiz;
@@ -174,7 +175,7 @@ static int sha3_final(struct shash_desc *desc, u8 *out)
sctx->buf[sctx->rsiz - 1] |= 0x80;
 
for (i = 0; i < sctx->rsizw; i++)
-   sctx->st[i] ^= ((u64 *) sctx->buf)[i];
+   sctx->st[i] ^= get_unaligned_le64(sctx->buf + 8 * i);
 
keccakf(sctx->st);
 
-- 
2.11.0

[PATCH 2/5] crypto/generic: sha3 - simplify code

2018-01-12 Thread Ard Biesheuvel

In preparation of exposing the generic SHA3 implementation to other
versions as a fallback, simplify the code, and remove an inconsistency
in the output handling (endian swabbing rsizw words of state before
writing the output does not make sense)

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 crypto/sha3_generic.c | 184 +++-
 include/crypto/sha3.h |   1 -
 2 files changed, 59 insertions(+), 126 deletions(-)

diff --git a/crypto/sha3_generic.c b/crypto/sha3_generic.c
index a68be626017c..677247d429a1 100644
--- a/crypto/sha3_generic.c
+++ b/crypto/sha3_generic.c
@@ -17,7 +17,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #define KECCAK_ROUNDS 24
@@ -88,43 +87,16 @@ static void keccakf(u64 st[25])
}
 }
 
-static void sha3_init(struct sha3_state *sctx, unsigned int digest_sz)
-{
-   memset(sctx, 0, sizeof(*sctx));
-   sctx->md_len = digest_sz;
-   sctx->rsiz = 200 - 2 * digest_sz;
-   sctx->rsizw = sctx->rsiz / 8;
-}
-
-static int sha3_224_init(struct shash_desc *desc)
+static int sha3_init(struct shash_desc *desc)
 {
struct sha3_state *sctx = shash_desc_ctx(desc);
+   unsigned int digest_size = crypto_shash_digestsize(desc->tfm);
 
-   sha3_init(sctx, SHA3_224_DIGEST_SIZE);
-   return 0;
-}
-
-static int sha3_256_init(struct shash_desc *desc)
-{
-   struct sha3_state *sctx = shash_desc_ctx(desc);
-
-   sha3_init(sctx, SHA3_256_DIGEST_SIZE);
-   return 0;
-}
-
-static int sha3_384_init(struct shash_desc *desc)
-{
-   struct sha3_state *sctx = shash_desc_ctx(desc);
-
-   sha3_init(sctx, SHA3_384_DIGEST_SIZE);
-   return 0;
-}
-
-static int sha3_512_init(struct shash_desc *desc)
-{
-   struct sha3_state *sctx = shash_desc_ctx(desc);
+   sctx->rsiz = 200 - 2 * digest_size;
+   sctx->rsizw = sctx->rsiz / 8;
+   sctx->partial = 0;
 
-   sha3_init(sctx, SHA3_512_DIGEST_SIZE);
+   memset(sctx->st, 0, sizeof(sctx->st));
return 0;
 }
 
@@ -169,6 +141,8 @@ static int sha3_final(struct shash_desc *desc, u8 *out)
 {
struct sha3_state *sctx = shash_desc_ctx(desc);
unsigned int i, inlen = sctx->partial;
+   unsigned int digest_size = crypto_shash_digestsize(desc->tfm);
+   __le64 *digest = (__le64 *)out;
 
sctx->buf[inlen++] = 0x06;
memset(sctx->buf + inlen, 0, sctx->rsiz - inlen);
@@ -179,110 +153,70 @@ static int sha3_final(struct shash_desc *desc, u8 *out)
 
keccakf(sctx->st);
 
-   for (i = 0; i < sctx->rsizw; i++)
-   sctx->st[i] = cpu_to_le64(sctx->st[i]);
+   for (i = 0; i < digest_size / 8; i++)
+   put_unaligned_le64(sctx->st[i], digest++);
 
-   memcpy(out, sctx->st, sctx->md_len);
+   if (digest_size & 4)
+   put_unaligned_le32(sctx->st[i], (__le32 *)digest);
 
memset(sctx, 0, sizeof(*sctx));
return 0;
 }
 
-static struct shash_alg sha3_224 = {
-   .digestsize =   SHA3_224_DIGEST_SIZE,
-   .init   =   sha3_224_init,
-   .update =   sha3_update,
-   .final  =   sha3_final,
-   .descsize   =   sizeof(struct sha3_state),
-   .base   =   {
-   .cra_name   =   "sha3-224",
-   .cra_driver_name =  "sha3-224-generic",
-   .cra_flags  =   CRYPTO_ALG_TYPE_SHASH,
-   .cra_blocksize  =   SHA3_224_BLOCK_SIZE,
-   .cra_module =   THIS_MODULE,
-   }
-};
-
-static struct shash_alg sha3_256 = {
-   .digestsize =   SHA3_256_DIGEST_SIZE,
-   .init   =   sha3_256_init,
-   .update =   sha3_update,
-   .final  =   sha3_final,
-   .descsize   =   sizeof(struct sha3_state),
-   .base   =   {
-   .cra_name   =   "sha3-256",
-   .cra_driver_name =  "sha3-256-generic",
-   .cra_flags  =   CRYPTO_ALG_TYPE_SHASH,
-   .cra_blocksize  =   SHA3_256_BLOCK_SIZE,
-   .cra_module =   THIS_MODULE,
-   }
-};
-
-static struct shash_alg sha3_384 = {
-   .digestsize =   SHA3_384_DIGEST_SIZE,
-   .init   =   sha3_384_init,
-   .update =   sha3_update,
-   .final  =   sha3_final,
-   .descsize   =   sizeof(struct sha3_state),
-   .base   =   {
-   .cra_name   =   "sha3-384",
-   .cra_driver_name =  "sha3-384-generic",
-   .cra_flags  =   CRYPTO_ALG_TYPE_SHASH,
-   .cra_blocksize  =   SHA3_384_BLOCK_SIZE,
-   .cra_module =   THIS_MODULE,
-   }
-};
-
-static struct shash_alg sha3_512 = {
-   .dig

[PATCH 4/5] crypto/arm64: sha3 - new implementation based on special instructions

2018-01-12 Thread Ard Biesheuvel

Implement the various flavours of SHA3 using the new optional
EOR3/RAX1/XAR/BCAX instructions introduced by ARMv8.2.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/Kconfig|   6 +
 arch/arm64/crypto/Makefile   |   3 +
 arch/arm64/crypto/sha3-ce-core.S | 224 
 arch/arm64/crypto/sha3-ce-glue.c | 156 ++
 4 files changed, 389 insertions(+)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index aad288f4b9de..4f2974687606 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -35,6 +35,12 @@ config CRYPTO_SHA512_ARM64_CE
select CRYPTO_HASH
select CRYPTO_SHA512_ARM64
 
+config CRYPTO_SHA3_ARM64_CE
+   tristate "SHA3 digest algorithm (ARMv8 Crypto Extensions)"
+   depends on KERNEL_MODE_NEON
+   select CRYPTO_HASH
+   select CRYPTO_SHA3
+
 config CRYPTO_GHASH_ARM64_CE
tristate "GHASH/AES-GCM using ARMv8 Crypto Extensions"
depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index d7573d31d397..04eaf8b78816 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -17,6 +17,9 @@ sha2-ce-y := sha2-ce-glue.o sha2-ce-core.o
 obj-$(CONFIG_CRYPTO_SHA512_ARM64_CE) += sha512-ce.o
 sha512-ce-y := sha512-ce-glue.o sha512-ce-core.o
 
+obj-$(CONFIG_CRYPTO_SHA3_ARM64_CE) += sha3-ce.o
+sha3-ce-y := sha3-ce-glue.o sha3-ce-core.o
+
 obj-$(CONFIG_CRYPTO_GHASH_ARM64_CE) += ghash-ce.o
 ghash-ce-y := ghash-ce-glue.o ghash-ce-core.o
 
diff --git a/arch/arm64/crypto/sha3-ce-core.S b/arch/arm64/crypto/sha3-ce-core.S
new file mode 100644
index ..b0b3d68ef3d3
--- /dev/null
+++ b/arch/arm64/crypto/sha3-ce-core.S
@@ -0,0 +1,224 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * sha512-ce-core.S - core SHA-384/SHA-512 transform using v8 Crypto Extensions
+ *
+ * Copyright (C) 2018 Linaro Ltd <ard.biesheu...@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+
+   .text
+
+   .irp
b,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
+   .set.Lv\b\().2d, \b
+   .set.Lv\b\().16b, \b
+   .endr
+
+   .macro  eor3, rd, rn, ra, rm
+   .inst   0xce00 | .L\rd | (.L\rn << 5) | (.L\ra << 10) | 
(.L\rm << 16)
+   .endm
+
+   .macro  rax1, rd, rn, rm
+   .inst   0xce608c00 | .L\rd | (.L\rn << 5) | (.L\rm << 16)
+   .endm
+
+   .macro  bcax, rd, rn, ra, rm
+   .inst   0xce20 | .L\rd | (.L\rn << 5) | (.L\ra << 10) | 
(.L\rm << 16)
+   .endm
+
+   .macro  xar, rd, rn, rm, imm6
+   .inst   0xce80 | .L\rd | (.L\rn << 5) | ((\imm6) << 10) | 
(.L\rm << 16)
+   .endm
+
+   /*
+* sha3_ce_transform(u64 *st, const u8 *data, int blocks, int dg_size);
+*/
+ENTRY(sha3_ce_transform)
+   /* load state */
+   mov x8, x0
+   ld1 { v0.1d- v3.1d}, [x8], #32
+   ld1 { v4.1d- v7.1d}, [x8], #32
+   ld1 { v8.1d-v11.1d}, [x8], #32
+   ld1 {v12.1d-v15.1d}, [x8], #32
+   ld1 {v16.1d-v19.1d}, [x8], #32
+   ld1 {v20.1d-v23.1d}, [x8], #32
+   ld1 {v24.1d}, [x8]
+
+0: sub w2, w2, #1
+   mov w8, #24
+   adr_l   x9, .Lsha3_rcon
+
+   /* load input */
+   ld1 {v25.8b-v28.8b}, [x1], #32
+   ld1 {v29.8b-v31.8b}, [x1], #24
+   eor v0.8b, v0.8b, v25.8b
+   eor v1.8b, v1.8b, v26.8b
+   eor v2.8b, v2.8b, v27.8b
+   eor v3.8b, v3.8b, v28.8b
+   eor v4.8b, v4.8b, v29.8b
+   eor v5.8b, v5.8b, v30.8b
+   eor v6.8b, v6.8b, v31.8b
+
+   tbnzx3, #6, 2f  // SHA3-512
+
+   ld1 {v25.8b-v28.8b}, [x1], #32
+   ld1 {v29.8b-v30.8b}, [x1], #16
+   eor  v7.8b,  v7.8b, v25.8b
+   eor  v8.8b,  v8.8b, v26.8b
+   eor  v9.8b,  v9.8b, v27.8b
+   eor v10.8b, v10.8b, v28.8b
+   eor v11.8b, v11.8b, v29.8b
+   eor v12.8b, v12.8b, v30.8b
+
+   tbnzx3, #4, 1f  // SHA3-384 or SHA3-224
+
+   // SHA3-256
+   ld1 {v25.8b-v28.8b}, [x1], #32
+   eor v13.8b, v13.8b, v25.8b
+   eor v14.8b, v14.8b, v26.8b
+   eor v15.8b, v15.8b, v27.8b
+   eor v16.8b, v16.8b, v28.8b
+   b   3f
+
+1: t

[PATCH 5/5] crypto/testmgr: sha3 - add new testcases

2018-01-12 Thread Ard Biesheuvel

All current SHA3 test cases are smaller than the SHA3 block size, which
means not all code paths are being exercised. So add a new test case to
each variant, and make one of the existing test cases chunked.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 crypto/testmgr.h | 550 
 1 file changed, 550 insertions(+)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index a714b6293959..6044f6906bd6 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -1052,6 +1052,142 @@ static const struct hash_testvec sha3_224_tv_template[] 
= {
"\xc9\xfd\x55\x74\x49\x44\x79\xba"
"\x5c\x7e\x7a\xb7\x6e\xf2\x64\xea"
"\xd0\xfc\xce\x33",
+   .np = 2,
+   .tap= { 28, 28 },
+   }, {
+   .plaintext = "\x08\x9f\x13\xaa\x41\xd8\x4c\xe3"
+"\x7a\x11\x85\x1c\xb3\x27\xbe\x55"
+"\xec\x60\xf7\x8e\x02\x99\x30\xc7"
+"\x3b\xd2\x69\x00\x74\x0b\xa2\x16"
+"\xad\x44\xdb\x4f\xe6\x7d\x14\x88"
+"\x1f\xb6\x2a\xc1\x58\xef\x63\xfa"
+"\x91\x05\x9c\x33\xca\x3e\xd5\x6c"
+"\x03\x77\x0e\xa5\x19\xb0\x47\xde"
+"\x52\xe9\x80\x17\x8b\x22\xb9\x2d"
+"\xc4\x5b\xf2\x66\xfd\x94\x08\x9f"
+"\x36\xcd\x41\xd8\x6f\x06\x7a\x11"
+"\xa8\x1c\xb3\x4a\xe1\x55\xec\x83"
+"\x1a\x8e\x25\xbc\x30\xc7\x5e\xf5"
+"\x69\x00\x97\x0b\xa2\x39\xd0\x44"
+"\xdb\x72\x09\x7d\x14\xab\x1f\xb6"
+"\x4d\xe4\x58\xef\x86\x1d\x91\x28"
+"\xbf\x33\xca\x61\xf8\x6c\x03\x9a"
+"\x0e\xa5\x3c\xd3\x47\xde\x75\x0c"
+"\x80\x17\xae\x22\xb9\x50\xe7\x5b"
+"\xf2\x89\x20\x94\x2b\xc2\x36\xcd"
+"\x64\xfb\x6f\x06\x9d\x11\xa8\x3f"
+"\xd6\x4a\xe1\x78\x0f\x83\x1a\xb1"
+"\x25\xbc\x53\xea\x5e\xf5\x8c\x00"
+"\x97\x2e\xc5\x39\xd0\x67\xfe\x72"
+"\x09\xa0\x14\xab\x42\xd9\x4d\xe4"
+"\x7b\x12\x86\x1d\xb4\x28\xbf\x56"
+"\xed\x61\xf8\x8f\x03\x9a\x31\xc8"
+"\x3c\xd3\x6a\x01\x75\x0c\xa3\x17"
+"\xae\x45\xdc\x50\xe7\x7e\x15\x89"
+"\x20\xb7\x2b\xc2\x59\xf0\x64\xfb"
+"\x92\x06\x9d\x34\xcb\x3f\xd6\x6d"
+"\x04\x78\x0f\xa6\x1a\xb1\x48\xdf"
+"\x53\xea\x81\x18\x8c\x23\xba\x2e"
+"\xc5\x5c\xf3\x67\xfe\x95\x09\xa0"
+"\x37\xce\x42\xd9\x70\x07\x7b\x12"
+"\xa9\x1d\xb4\x4b\xe2\x56\xed\x84"
+"\x1b\x8f\x26\xbd\x31\xc8\x5f\xf6"
+"\x6a\x01\x98\x0c\xa3\x3a\xd1\x45"
+"\xdc\x73\x0a\x7e\x15\xac\x20\xb7"
+"\x4e\xe5\x59\xf0\x87\x1e\x92\x29"
+"\xc0\x34\xcb\x62\xf9\x6d\x04\x9b"
+"\x0f\xa6\x3d\xd4\x48\xdf\x76\x0d"
+"\x81\x18\xaf\x23\xba\x51\xe8\x5c"
+"\xf3\x8a\x21\x95\x2c\xc3\x37\xce"
+"\x65\xfc\x70\x07\x9e\x12\xa9\x40"
+"\xd7\x4b\xe2\x79\x10\x84\x1b\xb2"
+"\x26\xbd\x54\xeb\x5f\xf6\x8d\x01"
+"\x98\x2f\xc6\x3a\xd1\x68\xff\x73"
+"\x0a\xa1\x15\xac\x43\xda\x4e\xe5"
+"\x7c\x13\x87\x1e\xb5\x29\xc0\x57"
+"\xee\x62\xf9\x90\x04\x9b\x32\xc9"
+"\x3d\xd4\x6b\x02\x76\x0d\xa4\x18"
+"\xaf\x46\xdd\x51\xe8\x7f\x16\x8a"
+"\x21\xb8\x2c\xc3\x5a\xf1\x65\xfc"
+"\x93\x07\x9e\x35\xcc\x40\xd7\x6e"
+"\x05\x79\x10\xa7\

[RFC PATCH] arm64/kernel: don't ban ADRP to work around Cortex-A53 erratum #843419

2018-01-10 Thread Ard Biesheuvel

Working around Cortex-A53 erratum #843419 involves special handling of
ADRP instructions that end up in the last two instruction slots of a
4k page, or whose output register gets overwritten without having been
read.

Normally, this gets taken care of by the linker, which can spot such
sequences at final link time, and insert a veneer if the ADRP ends up
at a vulnerable offset. However, linux kernel modules are partially
linked binaries, and so there is no 'final link time' other than the
runtime loading of the module, at which time all the static relocations
are resolved.

For this reason, we have implemented the #843419 workaround for modules
by avoiding ADRP instructions altogether, by using the large C model,
and by passing -mpc-relative-literal-loads to recent versions of GCC
that may emit adrp/ldr pairs to perform literal loads. However, this
workaround forces us to keep literal data mixed with the instructions
in the executable .text segment, and literal data may inadvertently
turn into an exploitable speculative gadget depending on the relative
offsets of arbitrary symbols.

So let's reimplement this workaround in a way that allows us to switch
back to the small C model, and to drop the -mpc-relative-literal-loads
GCC switch, by patching affected ADRP instructions at runtime:
- ADRP instructions that do not appear at 4k relative offset 0xff8 or
  0xffc are ignored
- ADRP instructions that are within 1 MB of their target symbol are
  converted into ADR instructions
- remaining ADRP instructions are redirected via a veneer that performs
  the load using an unaffected movn/movk sequence.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/Kconfig  |  4 +-
 arch/arm64/Makefile |  1 -
 arch/arm64/include/asm/module.h |  2 +
 arch/arm64/kernel/module-plts.c | 62 
 arch/arm64/kernel/module.c  | 32 +-
 arch/arm64/kernel/reloc_test_core.c |  4 +-
 arch/arm64/kernel/reloc_test_syms.S | 12 +++-
 7 files changed, 107 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c9a7e9e1414f..fa25de22b4fa 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -452,7 +452,7 @@ config ARM64_ERRATUM_845719
 config ARM64_ERRATUM_843419
bool "Cortex-A53: 843419: A load or store might access an incorrect 
address"
default y
-   select ARM64_MODULE_CMODEL_LARGE if MODULES
+   select ARM64_MODULE_PLTS if MODULES
help
  This option links the kernel with '--fix-cortex-a53-843419' and
  builds modules using the large memory model in order to avoid the use
@@ -1039,7 +1039,6 @@ config ARM64_MODULE_CMODEL_LARGE
 
 config ARM64_MODULE_PLTS
bool
-   select ARM64_MODULE_CMODEL_LARGE
select HAVE_MOD_ARCH_SPECIFIC
 
 config RELOCATABLE
@@ -1056,6 +1055,7 @@ config RELOCATABLE
 config RANDOMIZE_BASE
bool "Randomize the address of the kernel image"
select ARM64_MODULE_PLTS if MODULES
+   select ARM64_MODULE_CMODEL_LARGE
select RELOCATABLE
help
  Randomizes the virtual address at which the kernel image is
diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index bd7cb205e28a..f49aa51fce05 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -27,7 +27,6 @@ ifeq ($(CONFIG_ARM64_ERRATUM_843419),y)
 $(warning ld does not support --fix-cortex-a53-843419; kernel may be 
susceptible to erratum)
   else
 LDFLAGS_vmlinux+= --fix-cortex-a53-843419
-KBUILD_CFLAGS_MODULE   += $(call cc-option, -mpc-relative-literal-loads)
   endif
 endif
 
diff --git a/arch/arm64/include/asm/module.h b/arch/arm64/include/asm/module.h
index 4f766178fa6f..b6dbbe3123a9 100644
--- a/arch/arm64/include/asm/module.h
+++ b/arch/arm64/include/asm/module.h
@@ -39,6 +39,8 @@ struct mod_arch_specific {
 u64 module_emit_plt_entry(struct module *mod, void *loc, const Elf64_Rela 
*rela,
  Elf64_Sym *sym);
 
+u64 module_emit_adrp_veneer(struct module *mod, void *loc, u64 val);
+
 #ifdef CONFIG_RANDOMIZE_BASE
 extern u64 module_alloc_base;
 #else
diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c
index ea640f92fe5a..b4e7fe45d337 100644
--- a/arch/arm64/kernel/module-plts.c
+++ b/arch/arm64/kernel/module-plts.c
@@ -41,6 +41,47 @@ u64 module_emit_plt_entry(struct module *mod, void *loc, 
const Elf64_Rela *rela,
return (u64)[i];
 }
 
+#ifdef CONFIG_ARM64_ERRATUM_843419
+u64 module_emit_adrp_veneer(struct module *mod, void *loc, u64 val)
+{
+   struct mod_plt_sec *pltsec = !in_init(mod, loc) ? >arch.core :
+ >arch.init;
+   struct plt_entry *plt = (struct plt_entry *)pltsec->plt->sh_addr;
+   int i = pltsec->plt_num_entries;
+   u32 mov0, mov1, mov2, br;
+   int rd;
+
+   /* get the destination register of the ADRP instruction */

[PATCH 6/7] arm64/crypto: sha2-ce: move the round constant table to .rodata section

2018-01-10 Thread Ard Biesheuvel

Move the SHA2 round constant table to the .rodata section where it is
safe from being exploited by speculative execution.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/sha2-ce-core.S | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/crypto/sha2-ce-core.S b/arch/arm64/crypto/sha2-ce-core.S
index 679c6c002f4f..4c3c89b812ce 100644
--- a/arch/arm64/crypto/sha2-ce-core.S
+++ b/arch/arm64/crypto/sha2-ce-core.S
@@ -53,6 +53,7 @@
/*
 * The SHA-256 round constants
 */
+   .section".rodata", "a"
.align  4
 .Lsha2_rcon:
.word   0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5
@@ -76,9 +77,10 @@
 * void sha2_ce_transform(struct sha256_ce_state *sst, u8 const *src,
 *int blocks)
 */
+   .text
 ENTRY(sha2_ce_transform)
/* load round constants */
-   adr x8, .Lsha2_rcon
+   adr_l   x8, .Lsha2_rcon
ld1 { v0.4s- v3.4s}, [x8], #64
ld1 { v4.4s- v7.4s}, [x8], #64
ld1 { v8.4s-v11.4s}, [x8], #64
-- 
2.11.0

[PATCH 3/7] arm64/crypto: aes-neon: move literal data to .rodata section

2018-01-10 Thread Ard Biesheuvel

Move the S-boxes and some other literals to the .rodata section where
it is safe from being exploited by speculative execution.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/aes-neon.S | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/crypto/aes-neon.S b/arch/arm64/crypto/aes-neon.S
index f1e3aa2732f9..1c7b45b7268e 100644
--- a/arch/arm64/crypto/aes-neon.S
+++ b/arch/arm64/crypto/aes-neon.S
@@ -32,10 +32,10 @@
 
/* preload the entire Sbox */
.macro  prepare, sbox, shiftrows, temp
-   adr \temp, \sbox
moviv12.16b, #0x1b
-   ldr q13, \shiftrows
-   ldr q14, .Lror32by8
+   ldr_l   q13, \shiftrows, \temp
+   ldr_l   q14, .Lror32by8, \temp
+   adr_l   \temp, \sbox
ld1 {v16.16b-v19.16b}, [\temp], #64
ld1 {v20.16b-v23.16b}, [\temp], #64
ld1 {v24.16b-v27.16b}, [\temp], #64
@@ -272,7 +272,7 @@
 
 #include "aes-modes.S"
 
-   .text
+   .section".rodata", "a"
.align  6
 .LForward_Sbox:
.byte   0x63, 0x7c, 0x77, 0x7b, 0xf2, 0x6b, 0x6f, 0xc5
-- 
2.11.0

[PATCH 2/7] arm64/crypto: aes-cipher: move S-box to .rodata section

2018-01-10 Thread Ard Biesheuvel

Move the AES inverse S-box to the .rodata section where it is safe from
abuse by speculation.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/aes-cipher-core.S | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/crypto/aes-cipher-core.S 
b/arch/arm64/crypto/aes-cipher-core.S
index 6d2445d603cc..3a44eada2347 100644
--- a/arch/arm64/crypto/aes-cipher-core.S
+++ b/arch/arm64/crypto/aes-cipher-core.S
@@ -125,6 +125,16 @@ CPU_BE(rev w7, w7  )
ret
.endm
 
+ENTRY(__aes_arm64_encrypt)
+   do_cryptfround, crypto_ft_tab, crypto_ft_tab + 1, 2
+ENDPROC(__aes_arm64_encrypt)
+
+   .align  5
+ENTRY(__aes_arm64_decrypt)
+   do_cryptiround, crypto_it_tab, __aes_arm64_inverse_sbox, 0
+ENDPROC(__aes_arm64_decrypt)
+
+   .section".rodata", "a"
.align  L1_CACHE_SHIFT
.type   __aes_arm64_inverse_sbox, %object
 __aes_arm64_inverse_sbox:
@@ -161,12 +171,3 @@ __aes_arm64_inverse_sbox:
.byte   0x17, 0x2b, 0x04, 0x7e, 0xba, 0x77, 0xd6, 0x26
.byte   0xe1, 0x69, 0x14, 0x63, 0x55, 0x21, 0x0c, 0x7d
.size   __aes_arm64_inverse_sbox, . - __aes_arm64_inverse_sbox
-
-ENTRY(__aes_arm64_encrypt)
-   do_cryptfround, crypto_ft_tab, crypto_ft_tab + 1, 2
-ENDPROC(__aes_arm64_encrypt)
-
-   .align  5
-ENTRY(__aes_arm64_decrypt)
-   do_cryptiround, crypto_it_tab, __aes_arm64_inverse_sbox, 0
-ENDPROC(__aes_arm64_decrypt)
-- 
2.11.0

[PATCH 7/7] arm64/crypto: sha1-ce: get rid of literal pool

2018-01-10 Thread Ard Biesheuvel

Load the four SHA-1 round constants using immediates rather than literal
pool entries, to avoid having executable data that may be exploitable
under speculation attacks.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/sha1-ce-core.S | 20 +---
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/crypto/sha1-ce-core.S b/arch/arm64/crypto/sha1-ce-core.S
index 8550408735a0..46049850727d 100644
--- a/arch/arm64/crypto/sha1-ce-core.S
+++ b/arch/arm64/crypto/sha1-ce-core.S
@@ -58,12 +58,11 @@
sha1su1 v\s0\().4s, v\s3\().4s
.endm
 
-   /*
-* The SHA1 round constants
-*/
-   .align  4
-.Lsha1_rcon:
-   .word   0x5a827999, 0x6ed9eba1, 0x8f1bbcdc, 0xca62c1d6
+   .macro  loadrc, k, val, tmp
+   movz\tmp, :abs_g0_nc:\val
+   movk\tmp, :abs_g1:\val
+   dup \k, \tmp
+   .endm
 
/*
 * void sha1_ce_transform(struct sha1_ce_state *sst, u8 const *src,
@@ -71,11 +70,10 @@
 */
 ENTRY(sha1_ce_transform)
/* load round constants */
-   adr x6, .Lsha1_rcon
-   ld1r{k0.4s}, [x6], #4
-   ld1r{k1.4s}, [x6], #4
-   ld1r{k2.4s}, [x6], #4
-   ld1r{k3.4s}, [x6]
+   loadrc  k0.4s, 0x5a827999, w6
+   loadrc  k1.4s, 0x6ed9eba1, w6
+   loadrc  k2.4s, 0x8f1bbcdc, w6
+   loadrc  k3.4s, 0xca62c1d6, w6
 
/* load state */
ld1 {dgav.4s}, [x0]
-- 
2.11.0

[PATCH 4/7] arm64/crypto: crc32: move literal data to .rodata section

2018-01-10 Thread Ard Biesheuvel

Move CRC32 literal data to the .rodata section where it is safe from
being exploited by speculative execution.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/crc32-ce-core.S | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/crypto/crc32-ce-core.S 
b/arch/arm64/crypto/crc32-ce-core.S
index 18f5a8442276..16ed3c7ebd37 100644
--- a/arch/arm64/crypto/crc32-ce-core.S
+++ b/arch/arm64/crypto/crc32-ce-core.S
@@ -50,7 +50,7 @@
 #include 
 #include 
 
-   .text
+   .section".rodata", "a"
.align  6
.cpugeneric+crypto+crc
 
@@ -115,12 +115,13 @@
 * uint crc32_pmull_le(unsigned char const *buffer,
 * size_t len, uint crc32)
 */
+   .text
 ENTRY(crc32_pmull_le)
-   adr x3, .Lcrc32_constants
+   adr_l   x3, .Lcrc32_constants
b   0f
 
 ENTRY(crc32c_pmull_le)
-   adr x3, .Lcrc32c_constants
+   adr_l   x3, .Lcrc32c_constants
 
 0: bic LEN, LEN, #15
ld1 {v1.16b-v4.16b}, [BUF], #0x40
-- 
2.11.0

[PATCH 5/7] arm64/crypto: crct10dif: move literal data to .rodata section

2018-01-10 Thread Ard Biesheuvel

Move the CRC-T10DIF literal data to the .rodata section where it is
safe from being exploited by speculative execution.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/crct10dif-ce-core.S | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/crypto/crct10dif-ce-core.S 
b/arch/arm64/crypto/crct10dif-ce-core.S
index d5b5a8c038c8..f179c01bd55c 100644
--- a/arch/arm64/crypto/crct10dif-ce-core.S
+++ b/arch/arm64/crypto/crct10dif-ce-core.S
@@ -128,7 +128,7 @@ CPU_LE( ext v7.16b, v7.16b, v7.16b, #8  
)
// XOR the initial_crc value
eor v0.16b, v0.16b, v10.16b
 
-   ldr q10, rk3// xmm10 has rk3 and rk4
+   ldr_l   q10, rk3, x8// xmm10 has rk3 and rk4
// type of pmull instruction
// will determine which constant to use
 
@@ -184,13 +184,13 @@ CPU_LE(   ext v12.16b, v12.16b, v12.16b, #8   
)
// fold the 8 vector registers to 1 vector register with different
// constants
 
-   ldr q10, rk9
+   ldr_l   q10, rk9, x8
 
.macro  fold16, reg, rk
pmull   v8.1q, \reg\().1d, v10.1d
pmull2  \reg\().1q, \reg\().2d, v10.2d
.ifnb   \rk
-   ldr q10, \rk
+   ldr_l   q10, \rk, x8
.endif
eor v7.16b, v7.16b, v8.16b
eor v7.16b, v7.16b, \reg\().16b
@@ -251,7 +251,7 @@ CPU_LE( ext v1.16b, v1.16b, v1.16b, #8  
)
 
// get rid of the extra data that was loaded before
// load the shift constant
-   adr x4, tbl_shf_table + 16
+   adr_l   x4, tbl_shf_table + 16
sub x4, x4, arg3
ld1 {v0.16b}, [x4]
 
@@ -275,7 +275,7 @@ CPU_LE( ext v1.16b, v1.16b, v1.16b, #8  
)
 
 _128_done:
// compute crc of a 128-bit value
-   ldr q10, rk5// rk5 and rk6 in xmm10
+   ldr_l   q10, rk5, x8// rk5 and rk6 in xmm10
 
// 64b fold
ext v0.16b, vzr.16b, v7.16b, #8
@@ -291,7 +291,7 @@ _128_done:
 
// barrett reduction
 _barrett:
-   ldr q10, rk7
+   ldr_l   q10, rk7, x8
mov v0.d[0], v7.d[1]
 
pmull   v0.1q, v0.1d, v10.1d
@@ -321,7 +321,7 @@ CPU_LE( ext v7.16b, v7.16b, v7.16b, #8  
)
b.eq_128_done   // exactly 16 left
b.lt_less_than_16_left
 
-   ldr q10, rk1// rk1 and rk2 in xmm10
+   ldr_l   q10, rk1, x8// rk1 and rk2 in xmm10
 
// update the counter. subtract 32 instead of 16 to save one
// instruction from the loop
@@ -333,7 +333,7 @@ CPU_LE( ext v7.16b, v7.16b, v7.16b, #8  
)
 
 _less_than_16_left:
// shl r9, 4
-   adr x0, tbl_shf_table + 16
+   adr_l   x0, tbl_shf_table + 16
sub x0, x0, arg3
ld1 {v0.16b}, [x0]
moviv9.16b, #0x80
@@ -345,6 +345,7 @@ ENDPROC(crc_t10dif_pmull)
 // precomputed constants
 // these constants are precomputed from the poly:
 // 0x8bb7 (0x8bb7 scaled to 32 bits)
+   .section".rodata", "a"
.align  4
 // Q = 0x18BB7
 // rk1 = 2^(32*3) mod Q << 32
-- 
2.11.0

[PATCH 1/7] arm64: kernel: avoid executable literal pools

2018-01-10 Thread Ard Biesheuvel

Recent versions of GCC will emit literals into a separate .rodata section
rather than interspersed with the instruction stream. We disabled this
in commit 67dfa1751ce71 ("arm64: errata: Add -mpc-relative-literal-loads
to build flags"), because it uses adrp/add pairs to reference these
literals even when building with -mcmodel=large, which breaks module
loading when we have the mitigation for Cortex-A53 erratum #843419
enabled.

However, due to the recent discoveries regarding speculative execution,
we should avoid putting data into executable sections, to prevent
creating speculative gadgets inadvertently.

So set -mpc-relative-literal-loads only for modules, and only if the
A53 erratum is enabled.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index b481b4a7c011..bd7cb205e28a 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -26,7 +26,8 @@ ifeq ($(CONFIG_ARM64_ERRATUM_843419),y)
   ifeq ($(call ld-option, --fix-cortex-a53-843419),)
 $(warning ld does not support --fix-cortex-a53-843419; kernel may be 
susceptible to erratum)
   else
-LDFLAGS_vmlinux+= --fix-cortex-a53-843419
+LDFLAGS_vmlinux+= --fix-cortex-a53-843419
+KBUILD_CFLAGS_MODULE   += $(call cc-option, -mpc-relative-literal-loads)
   endif
 endif
 
@@ -51,7 +52,6 @@ endif
 
 KBUILD_CFLAGS  += -mgeneral-regs-only $(lseinstr) $(brokengasinst)
 KBUILD_CFLAGS  += -fno-asynchronous-unwind-tables
-KBUILD_CFLAGS  += $(call cc-option, -mpc-relative-literal-loads)
 KBUILD_AFLAGS  += $(lseinstr) $(brokengasinst)
 
 KBUILD_CFLAGS  += $(call cc-option,-mabi=lp64)
-- 
2.11.0

[PATCH 0/7] arm64: move literal data into .rodata section

2018-01-10 Thread Ard Biesheuvel

Prevent inadvertently creating speculative gadgets by moving literal data
into the .rodata section.

Patch #1 enables this for C code, by reverting a change that disables the
GCC feature implementing this. Note that this conflicts with the mitigation
of erratum #843419 for Cortex-A53.

Patch #2 - #7 update the crypto asm code to move sboxes and round constant
tables (which may or may not be hiding 'interesting' opcodes) from .text
to .rodata

Ard Biesheuvel (7):
  arm64: kernel: avoid executable literal pools
  arm64/crypto: aes-cipher: move S-box to .rodata section
  arm64/crypto: aes-neon: move literal data to .rodata section
  arm64/crypto: crc32: move literal data to .rodata section
  arm64/crypto: crct10dif: move literal data to .rodata section
  arm64/crypto: sha2-ce: move the round constant table to .rodata
section
  arm64/crypto: sha1-ce: get rid of literal pool

 arch/arm64/Makefile   |  4 ++--
 arch/arm64/crypto/aes-cipher-core.S   | 19 ++-
 arch/arm64/crypto/aes-neon.S  |  8 
 arch/arm64/crypto/crc32-ce-core.S |  7 ---
 arch/arm64/crypto/crct10dif-ce-core.S | 17 +
 arch/arm64/crypto/sha1-ce-core.S  | 20 +---
 arch/arm64/crypto/sha2-ce-core.S  |  4 +++-
 7 files changed, 41 insertions(+), 38 deletions(-)

-- 
2.11.0

[RFT PATCH] crypto: arm64 - implement SHA-512 using special instructions

2018-01-09 Thread Ard Biesheuvel

Implement the SHA-512 using the new special instructions that have
been introduced as an optional extension in ARMv8.2.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 arch/arm64/crypto/Kconfig  |   6 ++
 arch/arm64/crypto/Makefile |   3 +
 arch/arm64/crypto/sha512-ce-core.S | 207 +
 arch/arm64/crypto/sha512-ce-glue.c | 119 +
 4 files changed, 335 insertions(+)
 create mode 100644 arch/arm64/crypto/sha512-ce-core.S
 create mode 100644 arch/arm64/crypto/sha512-ce-glue.c

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 70c517aa4501..aad288f4b9de 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -29,6 +29,12 @@ config CRYPTO_SHA2_ARM64_CE
select CRYPTO_HASH
select CRYPTO_SHA256_ARM64
 
+config CRYPTO_SHA512_ARM64_CE
+   tristate "SHA-384/SHA-512 digest algorithm (ARMv8 Crypto Extensions)"
+   depends on KERNEL_MODE_NEON
+   select CRYPTO_HASH
+   select CRYPTO_SHA512_ARM64
+
 config CRYPTO_GHASH_ARM64_CE
tristate "GHASH/AES-GCM using ARMv8 Crypto Extensions"
depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index b5edc5918c28..d7573d31d397 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -14,6 +14,9 @@ sha1-ce-y := sha1-ce-glue.o sha1-ce-core.o
 obj-$(CONFIG_CRYPTO_SHA2_ARM64_CE) += sha2-ce.o
 sha2-ce-y := sha2-ce-glue.o sha2-ce-core.o
 
+obj-$(CONFIG_CRYPTO_SHA512_ARM64_CE) += sha512-ce.o
+sha512-ce-y := sha512-ce-glue.o sha512-ce-core.o
+
 obj-$(CONFIG_CRYPTO_GHASH_ARM64_CE) += ghash-ce.o
 ghash-ce-y := ghash-ce-glue.o ghash-ce-core.o
 
diff --git a/arch/arm64/crypto/sha512-ce-core.S 
b/arch/arm64/crypto/sha512-ce-core.S
new file mode 100644
index ..6c562f8df0b0
--- /dev/null
+++ b/arch/arm64/crypto/sha512-ce-core.S
@@ -0,0 +1,207 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * sha512-ce-core.S - core SHA-384/SHA-512 transform using v8 Crypto Extensions
+ *
+ * Copyright (C) 2018 Linaro Ltd <ard.biesheu...@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+
+   //
+   // Temporary - for testing only. binutils has no support for these yet
+   //
+   .irp
b,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
+   .set.Lq\b, \b
+   .set.Lv\b\().2d, \b
+   .endr
+
+   .macro  sha512h, rd, rn, rm
+   .inst   0xce608000 | .L\rd | (.L\rn << 5) | (.L\rm << 16)
+   .endm
+
+   .macro  sha512h2, rd, rn, rm
+   .inst   0xce608400 | .L\rd | (.L\rn << 5) | (.L\rm << 16)
+   .endm
+
+   .macro  sha512su0, rd, rn
+   .inst   0xcec08000 | .L\rd | (.L\rn << 5)
+   .endm
+
+   .macro  sha512su1, rd, rn, rm
+   .inst   0xce608800 | .L\rd | (.L\rn << 5) | (.L\rm << 16)
+   .endm
+
+   .text
+   .arch   armv8-a+crypto
+
+   /*
+* The SHA-512 round constants
+*/
+   .align  4
+.Lsha512_rcon:
+   .quad   0x428a2f98d728ae22, 0x7137449123ef65cd
+   .quad   0xb5c0fbcfec4d3b2f, 0xe9b5dba58189dbbc
+   .quad   0x3956c25bf348b538, 0x59f111f1b605d019
+   .quad   0x923f82a4af194f9b, 0xab1c5ed5da6d8118
+   .quad   0xd807aa98a3030242, 0x12835b0145706fbe
+   .quad   0x243185be4ee4b28c, 0x550c7dc3d5ffb4e2
+   .quad   0x72be5d74f27b896f, 0x80deb1fe3b1696b1
+   .quad   0x9bdc06a725c71235, 0xc19bf174cf692694
+   .quad   0xe49b69c19ef14ad2, 0xefbe4786384f25e3
+   .quad   0x0fc19dc68b8cd5b5, 0x240ca1cc77ac9c65
+   .quad   0x2de92c6f592b0275, 0x4a7484aa6ea6e483
+   .quad   0x5cb0a9dcbd41fbd4, 0x76f988da831153b5
+   .quad   0x983e5152ee66dfab, 0xa831c66d2db43210
+   .quad   0xb00327c898fb213f, 0xbf597fc7beef0ee4
+   .quad   0xc6e00bf33da88fc2, 0xd5a79147930aa725
+   .quad   0x06ca6351e003826f, 0x142929670a0e6e70
+   .quad   0x27b70a8546d22ffc, 0x2e1b21385c26c926
+   .quad   0x4d2c6dfc5ac42aed, 0x53380d139d95b3df
+   .quad   0x650a73548baf63de, 0x766a0abb3c77b2a8
+   .quad   0x81c2c92e47edaee6, 0x92722c851482353b
+   .quad   0xa2bfe8a14cf10364, 0xa81a664bbc423001
+   .quad   0xc24b8b70d0f89791, 0xc76c51a30654be30
+   .quad   0xd192e819d6ef5218, 0xd69906245565a910
+   .quad   0xf40e35855771202a, 0x106aa07032bbd1b8
+   .quad   0x19a4c116b8d2d0c8, 0x1e376c085141ab53
+

Re: Hang loading omap_rng on MacchiatoBin with 4.15-rc7

2018-01-09 Thread Ard Biesheuvel

On 9 January 2018 at 08:31, Riku Voipio  wrote:
> Hi,
>
> Loading omap_rng module on McBin causes hangup (in about 9/10 times).
> Looking at /proc/interrupts it seems the interrupt starts running like
> crazy, and after a while the whole system is unresponsive. This with
> Debian kernel (everything possible as modules) and EFI as bootloader.
> The EFI firmware appears[1] to use the rng unit to provide a seed for
> KASRL, I wonder if the driver needs to depend less on the state left
> by firmware, or the firmware needs to de-initialize the RNG before
> booting.
>
...
>  87:  0  0  0  0  ICU.f21e  95
> Level f276.trng
>  88:2532580  0  0  0  ICU.f41e  95
> Level f476.trng
...

My original code had

gMarvellTokenSpaceGuid.PcdEip76TrngBaseAddress|0xF276

which means the interrupt storm is being caused by the /other/ RNG,
not the one UEFI uses.

Could you please check whether your UEFI source is still using the
same base address?

Re: [PATCH] [v2] crypto: aes-generic - build with -Os on gcc-7+

2018-01-04 Thread Ard Biesheuvel

On 3 January 2018 at 22:39, Arnd Bergmann <a...@arndb.de> wrote:
> While testing other changes, I discovered that gcc-7.2.1 produces badly
> optimized code for aes_encrypt/aes_decrypt. This is especially true when
> CONFIG_UBSAN_SANITIZE_ALL is enabled, where it leads to extremely
> large stack usage that in turn might cause kernel stack overflows:
>
> crypto/aes_generic.c: In function 'aes_encrypt':
> crypto/aes_generic.c:1371:1: warning: the frame size of 4880 bytes is larger 
> than 2048 bytes [-Wframe-larger-than=]
> crypto/aes_generic.c: In function 'aes_decrypt':
> crypto/aes_generic.c:1441:1: warning: the frame size of 4864 bytes is larger 
> than 2048 bytes [-Wframe-larger-than=]
>
> I verified that this problem exists on all architectures that are
> supported by gcc-7.2, though arm64 in particular is less affected than
> the others. I also found that gcc-7.1 and gcc-8 do not show the extreme
> stack usage but still produce worse code than earlier versions for this
> file, apparently because of optimization passes that generally provide
> a substantial improvement in object code quality but understandably fail
> to find any shortcuts in the AES algorithm.
>
> Possible workarounds include
>
> a) disabling -ftree-pre and -ftree-sra optimizations, this was an earlier
>patch I tried, which reliably fixed the stack usage, but caused a
>serious performance regression in some versions, as later testing
>found.
>
> b) disabling UBSAN on this file or all ciphers, as suggested by Ard
>Biesheuvel. This would lead to massively better crypto performance in
>UBSAN-enabled kernels and avoid the stack usage, but there is a concern
>over whether we should exclude arbitrary files from UBSAN at all.
>
> c) Forcing the optimization level in a different way. Similar to a),
>but rather than deselecting specific optimization stages,
>this now uses "gcc -Os" for this file, regardless of the
>CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE/SIZE option. This is a reliable
>workaround for the stack consumption on all architecture, and I've
>retested the performance results now on x86, cycles/byte (lower is
>better) for cbc(aes-generic) with 256 bit keys:
>
> -O2 -Os
> gcc-6.3.1   14.915.1
> gcc-7.0.1   14.715.3
> gcc-7.1.1   15.314.7
> gcc-7.2.1   16.815.9
> gcc-8.0.0   15.515.6
>
> This implements the option c) by enabling forcing -Os on all compiler
> versions starting with gcc-7.1. As a workaround for PR83356, it would
> only be needed for gcc-7.2+ with UBSAN enabled, but since it also shows
> better performance on gcc-7.1 without UBSAN, it seems appropriate to
> use the faster version here as well.
>
> Side note: during testing, I also played with the AES code in libressl,
> which had a similar performance regression from gcc-6 to gcc-7.2,
> but was three times slower overall. It might be interesting to
> investigate that further and possibly port the Linux implementation
> into that.
>
> Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83356
> Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83651
> Cc: Richard Biener <rguent...@suse.de>
> Cc: Jakub Jelinek <ja...@gcc.gnu.org>
> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org>
> Signed-off-by: Arnd Bergmann <a...@arndb.de>

Acked-by: Ard Biesheuvel <ard.biesheu...@linaro.org>

> ---
>  crypto/Makefile | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/crypto/Makefile b/crypto/Makefile
> index d674884b2d51..daa69360e054 100644
> --- a/crypto/Makefile
> +++ b/crypto/Makefile
> @@ -99,6 +99,7 @@ obj-$(CONFIG_CRYPTO_TWOFISH_COMMON) += twofish_common.o
>  obj-$(CONFIG_CRYPTO_SERPENT) += serpent_generic.o
>  CFLAGS_serpent_generic.o := $(call cc-option,-fsched-pressure)  # 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149
>  obj-$(CONFIG_CRYPTO_AES) += aes_generic.o
> +CFLAGS_aes_generic.o := $(call cc-ifversion, -ge, 0701, -Os) # 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83356
>  obj-$(CONFIG_CRYPTO_AES_TI) += aes_ti.o
>  obj-$(CONFIG_CRYPTO_CAMELLIA) += camellia_generic.o
>  obj-$(CONFIG_CRYPTO_CAST_COMMON) += cast_common.o
> --
> 2.9.0
>

Re: [PATCH] [RFT] crypto: aes-generic - turn off -ftree-pre and -ftree-sra

2018-01-03 Thread Ard Biesheuvel

On 3 January 2018 at 16:37, Arnd Bergmann <a...@arndb.de> wrote:
> On Fri, Dec 22, 2017 at 4:47 PM, Ard Biesheuvel
> <ard.biesheu...@linaro.org> wrote:
>> On 21 December 2017 at 13:47, PrasannaKumar Muralidharan 
>> <prasannatsmku...@gmail.com> wrote:
>>> On 21 December 2017 at 17:52, Ard Biesheuvel <ard.biesheu...@linaro.org> 
>>> wrote:
>>>> On 21 December 2017 at 10:20, Arnd Bergmann <a...@arndb.de> wrote:
>>>>
>>>> So my vote is to disable UBSAN for all such cipher implementations:
>>>> aes_generic, but also aes_ti, which has a similar 256 byte lookup
>>>> table [although it does not seem to be affected by the same issue as
>>>> aes_generic], and possibly others as well.
>>>>
>>>> Perhaps it makes sense to move core cipher code into a separate
>>>> sub-directory, and disable UBSAN at the directory level?
>>>>
>>>> It would involve the following files
>>>>
>>>> crypto/aes_generic.c
>>>> crypto/aes_ti.c
>>>> crypto/anubis.c
>>>> crypto/arc4.c
>>>> crypto/blowfish_generic.c
>>>> crypto/camellia_generic.c
>>>> crypto/cast5_generic.c
>>>> crypto/cast6_generic.c
>>>> crypto/des_generic.c
>>>> crypto/fcrypt.c
>>>> crypto/khazad.c
>>>> crypto/seed.c
>>>> crypto/serpent_generic.c
>>>> crypto/tea.c
>>>> crypto/twofish_generic.c
>>>
>>> As *SAN is enabled only on developer setup, is such a change required?
>>> Looks like I am missing something here. Can you explain what value it
>>> provides?
>>>
>>
>> Well, in this particular case, the value it provides is that the
>> kernel can still boot and invoke the AES code without overflowing the
>> kernel stack. Of course, this is a compiler issue that hopefully gets
>> fixed, but I think it may be reasonable to exclude some C code from
>> UBSAN by default.
>
> Any idea how to proceed here? I've retested with the latest gcc snapshot
> and verified that the problem is still there. No idea what the chance of
> getting it fixed before the 7.3 release is. From the performance tests
> I've done, the patch I posted is pretty much useless, it causes significant
> performance regressions on most other compiler versions.
>
> A minimal patch would be to disable UBSAN specifically for aes-generic.c
> for gcc-7.2+ but not gcc-8 to avoid the potential stack overflow. We could
> also force building with -Os on gcc-7, and leave UBSAN enabled,
> this would improve performance some 3-5% on x86 with gcc-7 (both
> 7.1 and 7.2.1) and avoid the stack overflow.
>

Can't we just disable UBSAN for that file for all GCC versions and be
done with it? It is not a production feature, and that code is
unlikely to change in ways where UBSAN would make a difference anyway,
nor is it ever executed on 99.9% of systems running Linux.

> For the performance regression in gcc-7.2.1 on this file, I've opened
> a separate gcc PR now, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83651
> I've also tested the libressl version of their generic AES code, with
> mixed results (it's appears to be much slower than the kernel version
> to start with, and while it has further performance regressions with recent
> compilers, those are with a different set of versions compared to the
> kernel implementation, and it does not suffer from the high stack usage).
>

Re: [PATCH] [RFT] crypto: aes-generic - turn off -ftree-pre and -ftree-sra

2017-12-22 Thread Ard Biesheuvel

On 21 December 2017 at 13:47, PrasannaKumar Muralidharan
<prasannatsmku...@gmail.com> wrote:
> Hi Ard,
>
> On 21 December 2017 at 17:52, Ard Biesheuvel <ard.biesheu...@linaro.org> 
> wrote:
>> On 21 December 2017 at 10:20, Arnd Bergmann <a...@arndb.de> wrote:
>>> On Wed, Dec 20, 2017 at 10:46 PM, Jakub Jelinek <ja...@redhat.com> wrote:
>>>> On Wed, Dec 20, 2017 at 09:52:05PM +0100, Arnd Bergmann wrote:
>>>>> diff --git a/crypto/aes_generic.c b/crypto/aes_generic.c
>>>>> index ca554d57d01e..35f973ba9878 100644
>>>>> --- a/crypto/aes_generic.c
>>>>> +++ b/crypto/aes_generic.c
>>>>> @@ -1331,6 +1331,20 @@ EXPORT_SYMBOL_GPL(crypto_aes_set_key);
>>>>>   f_rl(bo, bi, 3, k); \
>>>>>  } while (0)
>>>>>
>>>>> +#if __GNUC__ >= 7
>>>>> +/*
>>>>> + * Newer compilers try to optimize integer arithmetic more aggressively,
>>>>> + * which generally improves code quality a lot, but in this specific case
>>>>> + * ends up hurting more than it helps, in some configurations drastically
>>>>> + * so. This turns off two optimization steps that have been shown to
>>>>> + * lead to rather badly optimized code with gcc-7.
>>>>> + *
>>>>> + * See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83356
>>>>> + */
>>>>> +#pragma GCC optimize("-fno-tree-pre")
>>>>> +#pragma GCC optimize("-fno-tree-sra")
>>>>
>>>> So do it only when UBSAN is enabled?  GCC doesn't have a particular
>>>> predefined macro for those (only for asan and tsan), but either the kernel
>>>> does have something already, or could have something added in the
>>>> corresponding Makefile.
>>>
>>> My original interpretation of the resulting object code suggested that 
>>> disabling
>>> those two optimizations produced better results for this particular
>>> file even without
>>> UBSAN, on both gcc-7 and gcc-8 (but not gcc-6), so my patch might have
>>> been better, but I did some measurements now as Ard suggested, showing
>>> cycles/byte for AES256/CBC with 8KB blocks:
>>>
>>>
>>>default  ubsan patchedpatched+ubsan
>>> gcc-4.3.614.9   14.9 
>>> gcc-4.6.415.0   15.8 
>>> gcc-4.9.415.520.7   15.9 20.9
>>> gcc-5.5.015.647.3   86.4 48.8
>>> gcc-6.3.114.649.4   94.3 50.9
>>> gcc-7.1.113.554.6   15.2 52.0
>>> gcc-7.2.116.8   124.7   92.0 52.2
>>> gcc-8.0.015.0  no boot  15.3no boot
>>>
>>> I checked that there are actually three significant digits on the 
>>> measurements,
>>> detailed output is available at https://pastebin.com/eFsWYjQp
>>>
>>> It seems that I was wrong about the interpretation that disabling
>>> the optimization would be a win on gcc-7 and higher, it almost
>>> always makes things worse even with UBSAN. Making that
>>> check "#if __GNUC__ == 7 && IS_ENABLED(CONFIG_UBSAN_SANITIZE_ALL)"
>>> would help here, or we could list the file as an exception for
>>> UBSAN and never sanitize it.
>>>
>>> Looking at the 'default' column, I wonder if anyone would be interested
>>> in looking at why the throughput regressed with gcc-7.2 and gcc-8.
>>>
>>
>> Thanks for the elaborate benchmarks. Looking at the bugzilla entry, it
>> appears the UBSAN code inserts runtime checks to ensure that certain
>> u8 variables don't assume values <0 or >255, which seems like a rather
>> pointless exercise to me. But even if it didn't, I think it is
>> justified to disable UBSAN on all of the low-level cipher
>> implementations, given that they are hardly ever modified, and
>> typically don't suffer from the issues UBSAN tries to identify.
>>
>> So my vote is to disable UBSAN for all such cipher implementations:
>> aes_generic, but also aes_ti, which has a similar 256 byte lookup
>> table [although it does not seem to be affected by the same issue as
>> aes_generic], and possibly others as well.
>>
>> Perhaps it makes sense to move core cipher code into a separate
>> sub-directory, and disable UBSAN at the directory level?
>>
>> It would involve the following files
>>
>> crypto/aes_generic.c
>> crypto/aes_ti.c
>> crypto/anubis.c
>> crypto/arc4.c
>> crypto/blowfish_generic.c
>> crypto/camellia_generic.c
>> crypto/cast5_generic.c
>> crypto/cast6_generic.c
>> crypto/des_generic.c
>> crypto/fcrypt.c
>> crypto/khazad.c
>> crypto/seed.c
>> crypto/serpent_generic.c
>> crypto/tea.c
>> crypto/twofish_generic.c
>
> As *SAN is enabled only on developer setup, is such a change required?
> Looks like I am missing something here. Can you explain what value it
> provides?
>

Well, in this particular case, the value it provides is that the
kernel can still boot and invoke the AES code without overflowing the
kernel stack. Of course, this is a compiler issue that hopefully gets
fixed, but I think it may be reasonable to exclude some C code from
UBSAN by default.

Re: [PATCH] [RFT] crypto: aes-generic - turn off -ftree-pre and -ftree-sra

2017-12-21 Thread Ard Biesheuvel

On 21 December 2017 at 10:20, Arnd Bergmann  wrote:
> On Wed, Dec 20, 2017 at 10:46 PM, Jakub Jelinek  wrote:
>> On Wed, Dec 20, 2017 at 09:52:05PM +0100, Arnd Bergmann wrote:
>>> diff --git a/crypto/aes_generic.c b/crypto/aes_generic.c
>>> index ca554d57d01e..35f973ba9878 100644
>>> --- a/crypto/aes_generic.c
>>> +++ b/crypto/aes_generic.c
>>> @@ -1331,6 +1331,20 @@ EXPORT_SYMBOL_GPL(crypto_aes_set_key);
>>>   f_rl(bo, bi, 3, k); \
>>>  } while (0)
>>>
>>> +#if __GNUC__ >= 7
>>> +/*
>>> + * Newer compilers try to optimize integer arithmetic more aggressively,
>>> + * which generally improves code quality a lot, but in this specific case
>>> + * ends up hurting more than it helps, in some configurations drastically
>>> + * so. This turns off two optimization steps that have been shown to
>>> + * lead to rather badly optimized code with gcc-7.
>>> + *
>>> + * See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83356
>>> + */
>>> +#pragma GCC optimize("-fno-tree-pre")
>>> +#pragma GCC optimize("-fno-tree-sra")
>>
>> So do it only when UBSAN is enabled?  GCC doesn't have a particular
>> predefined macro for those (only for asan and tsan), but either the kernel
>> does have something already, or could have something added in the
>> corresponding Makefile.
>
> My original interpretation of the resulting object code suggested that 
> disabling
> those two optimizations produced better results for this particular
> file even without
> UBSAN, on both gcc-7 and gcc-8 (but not gcc-6), so my patch might have
> been better, but I did some measurements now as Ard suggested, showing
> cycles/byte for AES256/CBC with 8KB blocks:
>
>
>default  ubsan patchedpatched+ubsan
> gcc-4.3.614.9   14.9 
> gcc-4.6.415.0   15.8 
> gcc-4.9.415.520.7   15.9 20.9
> gcc-5.5.015.647.3   86.4 48.8
> gcc-6.3.114.649.4   94.3 50.9
> gcc-7.1.113.554.6   15.2 52.0
> gcc-7.2.116.8   124.7   92.0 52.2
> gcc-8.0.015.0  no boot  15.3no boot
>
> I checked that there are actually three significant digits on the 
> measurements,
> detailed output is available at https://pastebin.com/eFsWYjQp
>
> It seems that I was wrong about the interpretation that disabling
> the optimization would be a win on gcc-7 and higher, it almost
> always makes things worse even with UBSAN. Making that
> check "#if __GNUC__ == 7 && IS_ENABLED(CONFIG_UBSAN_SANITIZE_ALL)"
> would help here, or we could list the file as an exception for
> UBSAN and never sanitize it.
>
> Looking at the 'default' column, I wonder if anyone would be interested
> in looking at why the throughput regressed with gcc-7.2 and gcc-8.
>

Thanks for the elaborate benchmarks. Looking at the bugzilla entry, it
appears the UBSAN code inserts runtime checks to ensure that certain
u8 variables don't assume values <0 or >255, which seems like a rather
pointless exercise to me. But even if it didn't, I think it is
justified to disable UBSAN on all of the low-level cipher
implementations, given that they are hardly ever modified, and
typically don't suffer from the issues UBSAN tries to identify.

So my vote is to disable UBSAN for all such cipher implementations:
aes_generic, but also aes_ti, which has a similar 256 byte lookup
table [although it does not seem to be affected by the same issue as
aes_generic], and possibly others as well.

Perhaps it makes sense to move core cipher code into a separate
sub-directory, and disable UBSAN at the directory level?

It would involve the following files

crypto/aes_generic.c
crypto/aes_ti.c
crypto/anubis.c
crypto/arc4.c
crypto/blowfish_generic.c
crypto/camellia_generic.c
crypto/cast5_generic.c
crypto/cast6_generic.c
crypto/des_generic.c
crypto/fcrypt.c
crypto/khazad.c
crypto/seed.c
crypto/serpent_generic.c
crypto/tea.c
crypto/twofish_generic.c

Re: [PATCH] [RFT] crypto: aes-generic - turn off -ftree-pre and -ftree-sra

2017-12-20 Thread Ard Biesheuvel

On 20 December 2017 at 20:52, Arnd Bergmann  wrote:
> While testing other changes, I discovered that gcc-7.2.1 produces badly
> optimized code for aes_encrypt/aes_decrypt. This is especially true when
> CONFIG_UBSAN_SANITIZE_ALL is enabled, where it leads to extremely
> large stack usage that in turn might cause kernel stack overflows:
>
> crypto/aes_generic.c: In function 'aes_encrypt':
> crypto/aes_generic.c:1371:1: warning: the frame size of 4880 bytes is larger 
> than 2048 bytes [-Wframe-larger-than=]
> crypto/aes_generic.c: In function 'aes_decrypt':
> crypto/aes_generic.c:1441:1: warning: the frame size of 4864 bytes is larger 
> than 2048 bytes [-Wframe-larger-than=]
>
> I verified that this problem exists on all architectures that are
> supported by gcc-7.2, though arm64 in particular is less affected than
> the others. I also found that gcc-7.1 and gcc-8 do not show the extreme
> stack usage but still produce worse code than earlier versions for this
> file, apparently because of optimization passes that generally provide
> a substantial improvement in object code quality but understandably fail
> to find any shortcuts in the AES algorithm.
>
> Turning off the tree-pre and tree-sra optimization steps seems to
> reverse the effect, and could be used as a workaround in case we
> don't get a good gcc fix.
>
> Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83356
> Cc: Richard Biener 
> Cc: Jakub Jelinek 
> Signed-off-by: Arnd Bergmann 
> ---
> Jakub and Richard have done a more detailed analysis of this, and are
> working on ways to improve the code again. In the meantime, I'm sending
> this workaround to the Linux crypto maintainers to make them aware of
> this issue and for testing.
>
> What would be a good way to test the performance of the AES code with
> the various combinations of compiler versions, as well as UBSAN and this
> patch enabled or disabled?

You can use the tcrypt.ko module to benchmark AES.

modprobe tcrypt mode=200 sec=1

to run a (lengthy) AES benchmark in various modes. AES-128 in ECB mode
using the largest block size tested is what I usually use for
comparison.

On my Cortex-A57, the generic AES code runs at ~18 cycles per byte.
Note that we have alternative scalar implementations on ARM and arm64
that are faster so the performance of aes-generic is not really
relevant (and so it is essentially dead code)


> ---
>  crypto/aes_generic.c | 14 ++
>  1 file changed, 14 insertions(+)
>
> diff --git a/crypto/aes_generic.c b/crypto/aes_generic.c
> index ca554d57d01e..35f973ba9878 100644
> --- a/crypto/aes_generic.c
> +++ b/crypto/aes_generic.c
> @@ -1331,6 +1331,20 @@ EXPORT_SYMBOL_GPL(crypto_aes_set_key);
> f_rl(bo, bi, 3, k); \
>  } while (0)
>
> +#if __GNUC__ >= 7
> +/*
> + * Newer compilers try to optimize integer arithmetic more aggressively,
> + * which generally improves code quality a lot, but in this specific case
> + * ends up hurting more than it helps, in some configurations drastically
> + * so. This turns off two optimization steps that have been shown to
> + * lead to rather badly optimized code with gcc-7.
> + *
> + * See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83356
> + */
> +#pragma GCC optimize("-fno-tree-pre")
> +#pragma GCC optimize("-fno-tree-sra")
> +#endif
> +
>  static void aes_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
>  {
> const struct crypto_aes_ctx *ctx = crypto_tfm_ctx(tfm);
> --
> 2.9.0
>

Re: [RFC PATCH] crypto: chacha20 - add implementation using 96-bit nonce

2017-12-08 Thread Ard Biesheuvel

On 8 December 2017 at 23:11, Eric Biggers <ebigge...@gmail.com> wrote:
> On Fri, Dec 08, 2017 at 10:54:24PM +0000, Ard Biesheuvel wrote:
>> >> Note that there are two conflicting conventions for what inputs ChaCha 
>> >> takes.
>> >> The original paper by Daniel Bernstein
>> >> (https://cr.yp.to/chacha/chacha-20080128.pdf) says that the block counter 
>> >> is
>> >> 64-bit and the nonce is 64-bit, thereby expanding the key into 2^64 
>> >> randomly
>> >> accessible streams, each containing 2^64 randomly accessible 64-byte 
>> >> blocks.
>> >>
>> >> The RFC 7539 convention is equivalent to seeking to a large offset 
>> >> (determined
>> >> by the first 32 bits of the 96-bit nonce) in the keystream defined by the 
>> >> djb
>> >> convention, but only if the 32-bit portion of the block counter never 
>> >> overflows.
>> >>
>> >> Maybe it is only RFC 7539 that matters because that is what is being
>> >> standardized by the IETF; I don't know.  But it confused me.
>> >>
>> >
>> > The distinction only matters if you start the counter at zero (or
>> > one), because you 'lose' 32 bits of IV that will never be != 0 in
>> > practice if you use a 64-bit counter.
>> >
>> > So that argues for not exposing the block counter as part of the API,
>> > given that it should start at zero anyway, and that you should take
>> > care not to put colliding values in it.
>> >
>> >> Anyway, I actually thought it was intentional that the ChaCha 
>> >> implementations in
>> >> the Linux kernel allowed specifying the block counter, and therefore 
>> >> allowed
>> >> seeking to any point in the keystream, exposing the full functionality of 
>> >> the
>> >> cipher.  It's true that it's easily misused though, so there may 
>> >> nevertheless be
>> >> value in providing a nonce-only variant.
>> >>
>> >
>> > Currently, the skcipher API does not allow such random access, so
>> > while I can see how that could be a useful feature, we can't really
>> > make use of it today. But more importantly, it still does not mean the
>> > block counter should be exposed to the /users/ of the skcipher API
>> > which typically encrypt/decrypt blocks that are much larger than 64
>> > bytes.
>>
>> ... but now that I think of it, how is this any different from, say,
>> AES in CTR mode? The counter is big endian, but apart from that, using
>> IVs derived from a counter will result in the exact same issue, only
>> with a shift of 16 bytes.
>>
>> That means using file block numbers as IV is simply inappropriate, and
>> you should encrypt them first like is done for AES-CBC
>
> The problem with using a stream cipher --- whether that is ChaCha20, AES-CTR, 
> or
> something else --- for disk/file encryption is that by necessity of file/disk
> encryption, each time the "same" block is written to, the IV is the same, 
> which
> is really bad for stream ciphers (but not as bad for AES-XTS, AES-CBC, etc.).
> It's irrelevant whether you do ESSIV or otherwise encrypt the IVs.  ESSIV does
> make the IV for each offset unpredictable by an attacker, which is desirable 
> for
> AES-CBC, but it doesn't stop the IV from being repeated for each overwrite.
>

I'm not suggesting using an encrypted IV to fix the stream cipher
issue, I'm well aware that that is impossible. What I am saying is
that the counter collision can be mitigated by encrypting the IV.

> And just to clarify, you definitely *can* seek to any position in the ChaCha20
> stream using the existing ChaCha20 implementations and the existing skcipher
> API, simply by providing the appropriate IV.  Maybe it was unintentional, but 
> it
> does work.  chacha20poly1305.c even uses it to start at block 1 instead of 
> block
> 0.  I don't know whether there are other users, though.
>

Well, I understand that that's how ChaCha20 works, and that you can
manipulate the IV directly to start at another point in the keystream.
AES-CTR can do exactly the same, for the same reason. What I am saying
is that the skcipher API does not allow you to decrypt an arbitrary
part of a block, which could benefit from the ability of not having to
generate the entire key stream.

So the more we discuss this, the more I think there is actually no
difference with AES-CTR (apart from the block size), and there a
similar enhancement in RFC3686 where the IV does not cover the AES
block level counter, making it safe to use another counter to generate
the IVs.

Of course, this is essentially what you did for the fscrypt code, I
just don't like seeing that kind of reasoning being implement in the
crypto API client.

Re: [RFC PATCH] crypto: chacha20 - add implementation using 96-bit nonce

2017-12-08 Thread Ard Biesheuvel

On 8 December 2017 at 22:42, Ard Biesheuvel <ard.biesheu...@linaro.org> wrote:
> On 8 December 2017 at 22:17, Eric Biggers <ebigge...@gmail.com> wrote:
>> On Fri, Dec 08, 2017 at 11:55:02AM +, Ard Biesheuvel wrote:
>>> As pointed out by Eric [0], the way RFC7539 was interpreted when creating
>>> our implementation of ChaCha20 creates a risk of IV reuse when using a
>>> little endian counter as the IV generator. The reason is that the low end
>>> bits of the counter get mapped onto the ChaCha20 block counter, which
>>> advances every 64 bytes. This means that the counter value that gets
>>> selected as IV for the next input block will collide with the ChaCha20
>>> block counter of the previous block, basically recreating the same
>>> keystream but shifted by 64 bytes.
>>>
>>> RFC7539 describes the inputs of the algorithm as follows:
>>>
>>>   The inputs to ChaCha20 are:
>>>
>>>  o  A 256-bit key
>>>
>>>  o  A 32-bit initial counter.  This can be set to any number, but will
>>> usually be zero or one.  It makes sense to use one if we use the
>>> zero block for something else, such as generating a one-time
>>> authenticator key as part of an AEAD algorithm.
>>>
>>>  o  A 96-bit nonce.  In some protocols, this is known as the
>>> Initialization Vector.
>>>
>>>  o  An arbitrary-length plaintext
>>>
>>> The solution is to use a fixed value of 0 for the initial counter, and
>>> only expose a 96-bit IV to the upper layers of the crypto API.
>>>
>>> So introduce a new ChaCha20 flavor called chacha20-iv96, which takes the
>>> above into account, and should become the preferred ChaCha20
>>> implementation going forward for general use.
>>
>> Note that there are two conflicting conventions for what inputs ChaCha takes.
>> The original paper by Daniel Bernstein
>> (https://cr.yp.to/chacha/chacha-20080128.pdf) says that the block counter is
>> 64-bit and the nonce is 64-bit, thereby expanding the key into 2^64 randomly
>> accessible streams, each containing 2^64 randomly accessible 64-byte blocks.
>>
>> The RFC 7539 convention is equivalent to seeking to a large offset 
>> (determined
>> by the first 32 bits of the 96-bit nonce) in the keystream defined by the djb
>> convention, but only if the 32-bit portion of the block counter never 
>> overflows.
>>
>> Maybe it is only RFC 7539 that matters because that is what is being
>> standardized by the IETF; I don't know.  But it confused me.
>>
>
> The distinction only matters if you start the counter at zero (or
> one), because you 'lose' 32 bits of IV that will never be != 0 in
> practice if you use a 64-bit counter.
>
> So that argues for not exposing the block counter as part of the API,
> given that it should start at zero anyway, and that you should take
> care not to put colliding values in it.
>
>> Anyway, I actually thought it was intentional that the ChaCha 
>> implementations in
>> the Linux kernel allowed specifying the block counter, and therefore allowed
>> seeking to any point in the keystream, exposing the full functionality of the
>> cipher.  It's true that it's easily misused though, so there may 
>> nevertheless be
>> value in providing a nonce-only variant.
>>
>
> Currently, the skcipher API does not allow such random access, so
> while I can see how that could be a useful feature, we can't really
> make use of it today. But more importantly, it still does not mean the
> block counter should be exposed to the /users/ of the skcipher API
> which typically encrypt/decrypt blocks that are much larger than 64
> bytes.

... but now that I think of it, how is this any different from, say,
AES in CTR mode? The counter is big endian, but apart from that, using
IVs derived from a counter will result in the exact same issue, only
with a shift of 16 bytes.

That means using file block numbers as IV is simply inappropriate, and
you should encrypt them first like is done for AES-CBC

Re: [RFC PATCH] crypto: chacha20 - add implementation using 96-bit nonce

2017-12-08 Thread Ard Biesheuvel

On 8 December 2017 at 22:17, Eric Biggers <ebigge...@gmail.com> wrote:
> On Fri, Dec 08, 2017 at 11:55:02AM +0000, Ard Biesheuvel wrote:
>> As pointed out by Eric [0], the way RFC7539 was interpreted when creating
>> our implementation of ChaCha20 creates a risk of IV reuse when using a
>> little endian counter as the IV generator. The reason is that the low end
>> bits of the counter get mapped onto the ChaCha20 block counter, which
>> advances every 64 bytes. This means that the counter value that gets
>> selected as IV for the next input block will collide with the ChaCha20
>> block counter of the previous block, basically recreating the same
>> keystream but shifted by 64 bytes.
>>
>> RFC7539 describes the inputs of the algorithm as follows:
>>
>>   The inputs to ChaCha20 are:
>>
>>  o  A 256-bit key
>>
>>  o  A 32-bit initial counter.  This can be set to any number, but will
>> usually be zero or one.  It makes sense to use one if we use the
>> zero block for something else, such as generating a one-time
>> authenticator key as part of an AEAD algorithm.
>>
>>  o  A 96-bit nonce.  In some protocols, this is known as the
>> Initialization Vector.
>>
>>  o  An arbitrary-length plaintext
>>
>> The solution is to use a fixed value of 0 for the initial counter, and
>> only expose a 96-bit IV to the upper layers of the crypto API.
>>
>> So introduce a new ChaCha20 flavor called chacha20-iv96, which takes the
>> above into account, and should become the preferred ChaCha20
>> implementation going forward for general use.
>
> Note that there are two conflicting conventions for what inputs ChaCha takes.
> The original paper by Daniel Bernstein
> (https://cr.yp.to/chacha/chacha-20080128.pdf) says that the block counter is
> 64-bit and the nonce is 64-bit, thereby expanding the key into 2^64 randomly
> accessible streams, each containing 2^64 randomly accessible 64-byte blocks.
>
> The RFC 7539 convention is equivalent to seeking to a large offset (determined
> by the first 32 bits of the 96-bit nonce) in the keystream defined by the djb
> convention, but only if the 32-bit portion of the block counter never 
> overflows.
>
> Maybe it is only RFC 7539 that matters because that is what is being
> standardized by the IETF; I don't know.  But it confused me.
>

The distinction only matters if you start the counter at zero (or
one), because you 'lose' 32 bits of IV that will never be != 0 in
practice if you use a 64-bit counter.

So that argues for not exposing the block counter as part of the API,
given that it should start at zero anyway, and that you should take
care not to put colliding values in it.

> Anyway, I actually thought it was intentional that the ChaCha implementations 
> in
> the Linux kernel allowed specifying the block counter, and therefore allowed
> seeking to any point in the keystream, exposing the full functionality of the
> cipher.  It's true that it's easily misused though, so there may nevertheless 
> be
> value in providing a nonce-only variant.
>

Currently, the skcipher API does not allow such random access, so
while I can see how that could be a useful feature, we can't really
make use of it today. But more importantly, it still does not mean the
block counter should be exposed to the /users/ of the skcipher API
which typically encrypt/decrypt blocks that are much larger than 64
bytes.

< 1 2 3 4 5 6 7 8 9 >

101 - 200 of 837 matches

Mail list logo