RE: [PATCH 1/5] An optimized Chacha20 implementation with 8-way unrolling for ppc64le.

2023-04-26 Thread David Laight
From: Danny Tsen
> Sent: 24 April 2023 19:47
> 
> Improve overall performance of chacha20 encrypt and decrypt operations
> for Power10 or later CPU.
> 
> Signed-off-by: Danny Tsen 
> ---
>  arch/powerpc/crypto/chacha-p10le-8x.S | 842 ++
>  1 file changed, 842 insertions(+)
>  create mode 100644 arch/powerpc/crypto/chacha-p10le-8x.S
...
> +.macro QT_loop_8x
> + # QR(v0, v4,  v8, v12, v1, v5,  v9, v13, v2, v6, v10, v14, v3, v7, v11, 
> v15)
> + xxlor   0, 32+25, 32+25
> + xxlor   32+25, 20, 20
> + vadduwm 0, 0, 4
> + vadduwm 1, 1, 5
> + vadduwm 2, 2, 6
> + vadduwm 3, 3, 7
> +   vadduwm 16, 16, 20
> +   vadduwm 17, 17, 21
> +   vadduwm 18, 18, 22
> +   vadduwm 19, 19, 23
> +
> +   vpermxor 12, 12, 0, 25
> +   vpermxor 13, 13, 1, 25
> +   vpermxor 14, 14, 2, 25
> +   vpermxor 15, 15, 3, 25
> +   vpermxor 28, 28, 16, 25
> +   vpermxor 29, 29, 17, 25
> +   vpermxor 30, 30, 18, 25
> +   vpermxor 31, 31, 19, 25
> + xxlor   32+25, 0, 0
> + vadduwm 8, 8, 12
> + vadduwm 9, 9, 13
> + vadduwm 10, 10, 14
> + vadduwm 11, 11, 15
...

Is it just me or is all this code just complete jibberish?

There really ought to be enough comments so that it is possible
to check that the code is doing something that looks like chacha20
without spending all day tracking register numbers through
hundreds of lines of assembler.

I also wonder how much faster the 8-way unroll is?
On modern cpu with 'out of order' execute (etc) it is
not impossible to get the loop operations 'for free'
because they use execution units that are otherwise idle.

Massive loop unrolling is so 1980's.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: [PATCH 1/5] An optimized Chacha20 implementation with 8-way unrolling for ppc64le.

2023-04-25 Thread Danny Tsen

Hi Michael,

It's in IBM repo.

Thanks.

-Danny

On 4/25/23 7:02 AM, Michael Ellerman wrote:

Danny Tsen  writes:

This is recommended template to use for IBM copyright.

According to who?

The documentation I've seen specifies "IBM Corp." or "IBM Corporation".

cheers


Re: [PATCH 1/5] An optimized Chacha20 implementation with 8-way unrolling for ppc64le.

2023-04-25 Thread Michael Ellerman
Danny Tsen  writes:
> This is recommended template to use for IBM copyright.

According to who?

The documentation I've seen specifies "IBM Corp." or "IBM Corporation".

cheers


Re: [PATCH 1/5] An optimized Chacha20 implementation with 8-way unrolling for ppc64le.

2023-04-24 Thread Danny Tsen

This is recommended template to use for IBM copyright.

Thanks.

-Danny

On 4/24/23 3:40 PM, Elliott, Robert (Servers) wrote:

+# Copyright 2023- IBM Inc. All rights reserved

I don't think any such entity exists - you probably mean IBM Corporation.


RE: [PATCH 1/5] An optimized Chacha20 implementation with 8-way unrolling for ppc64le.

2023-04-24 Thread Elliott, Robert (Servers)
> +# Copyright 2023- IBM Inc. All rights reserved

I don't think any such entity exists - you probably mean IBM Corporation.


[PATCH 1/5] An optimized Chacha20 implementation with 8-way unrolling for ppc64le.

2023-04-24 Thread Danny Tsen
Improve overall performance of chacha20 encrypt and decrypt operations
for Power10 or later CPU.

Signed-off-by: Danny Tsen 
---
 arch/powerpc/crypto/chacha-p10le-8x.S | 842 ++
 1 file changed, 842 insertions(+)
 create mode 100644 arch/powerpc/crypto/chacha-p10le-8x.S

diff --git a/arch/powerpc/crypto/chacha-p10le-8x.S 
b/arch/powerpc/crypto/chacha-p10le-8x.S
new file mode 100644
index ..7c15d17101d7
--- /dev/null
+++ b/arch/powerpc/crypto/chacha-p10le-8x.S
@@ -0,0 +1,842 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#
+# Accelerated chacha20 implementation for ppc64le.
+#
+# Copyright 2023- IBM Inc. All rights reserved
+#
+#===
+# Written by Danny Tsen 
+#
+# chacha_p10le_8x(u32 *state, byte *dst, const byte *src,
+#   size_t len, int nrounds);
+#
+# do rounds,  8 quarter rounds
+# 1.  a += b; d ^= a; d <<<= 16;
+# 2.  c += d; b ^= c; b <<<= 12;
+# 3.  a += b; d ^= a; d <<<= 8;
+# 4.  c += d; b ^= c; b <<<= 7
+#
+# row1 = (row1 + row2),  row4 = row1 xor row4,  row4 rotate each word by 16
+# row3 = (row3 + row4),  row2 = row3 xor row2,  row2 rotate each word by 12
+# row1 = (row1 + row2), row4 = row1 xor row4,  row4 rotate each word by 8
+# row3 = (row3 + row4), row2 = row3 xor row2,  row2 rotate each word by 7
+#
+# 4 blocks (a b c d)
+#
+# a0 b0 c0 d0
+# a1 b1 c1 d1
+# ...
+# a4 b4 c4 d4
+# ...
+# a8 b8 c8 d8
+# ...
+# a12 b12 c12 d12
+# a13 ...
+# a14 ...
+# a15 b15 c15 d15
+#
+# Column round (v0, v4,  v8, v12, v1, v5,  v9, v13, v2, v6, v10, v14, v3, v7, 
v11, v15)
+# Diagnal round (v0, v5, v10, v15, v1, v6, v11, v12, v2, v7,  v8, v13, v3, v4, 
 v9, v14)
+#
+
+#include 
+#include 
+#include 
+#include 
+
+.machine   "any"
+.text
+
+.macro SAVE_GPR GPR OFFSET FRAME
+   std \GPR,\OFFSET(\FRAME)
+.endm
+
+.macro SAVE_VRS VRS OFFSET FRAME
+   li  16, \OFFSET
+   stvx\VRS, 16, \FRAME
+.endm
+
+.macro SAVE_VSX VSX OFFSET FRAME
+   li  16, \OFFSET
+   stxvx   \VSX, 16, \FRAME
+.endm
+
+.macro RESTORE_GPR GPR OFFSET FRAME
+   ld  \GPR,\OFFSET(\FRAME)
+.endm
+
+.macro RESTORE_VRS VRS OFFSET FRAME
+   li  16, \OFFSET
+   lvx \VRS, 16, \FRAME
+.endm
+
+.macro RESTORE_VSX VSX OFFSET FRAME
+   li  16, \OFFSET
+   lxvx\VSX, 16, \FRAME
+.endm
+
+.macro SAVE_REGS
+   mflr 0
+   std 0, 16(1)
+   stdu 1,-752(1)
+
+   SAVE_GPR 14, 112, 1
+   SAVE_GPR 15, 120, 1
+   SAVE_GPR 16, 128, 1
+   SAVE_GPR 17, 136, 1
+   SAVE_GPR 18, 144, 1
+   SAVE_GPR 19, 152, 1
+   SAVE_GPR 20, 160, 1
+   SAVE_GPR 21, 168, 1
+   SAVE_GPR 22, 176, 1
+   SAVE_GPR 23, 184, 1
+   SAVE_GPR 24, 192, 1
+   SAVE_GPR 25, 200, 1
+   SAVE_GPR 26, 208, 1
+   SAVE_GPR 27, 216, 1
+   SAVE_GPR 28, 224, 1
+   SAVE_GPR 29, 232, 1
+   SAVE_GPR 30, 240, 1
+   SAVE_GPR 31, 248, 1
+
+   addi9, 1, 256
+   SAVE_VRS 20, 0, 9
+   SAVE_VRS 21, 16, 9
+   SAVE_VRS 22, 32, 9
+   SAVE_VRS 23, 48, 9
+   SAVE_VRS 24, 64, 9
+   SAVE_VRS 25, 80, 9
+   SAVE_VRS 26, 96, 9
+   SAVE_VRS 27, 112, 9
+   SAVE_VRS 28, 128, 9
+   SAVE_VRS 29, 144, 9
+   SAVE_VRS 30, 160, 9
+   SAVE_VRS 31, 176, 9
+
+   SAVE_VSX 14, 192, 9
+   SAVE_VSX 15, 208, 9
+   SAVE_VSX 16, 224, 9
+   SAVE_VSX 17, 240, 9
+   SAVE_VSX 18, 256, 9
+   SAVE_VSX 19, 272, 9
+   SAVE_VSX 20, 288, 9
+   SAVE_VSX 21, 304, 9
+   SAVE_VSX 22, 320, 9
+   SAVE_VSX 23, 336, 9
+   SAVE_VSX 24, 352, 9
+   SAVE_VSX 25, 368, 9
+   SAVE_VSX 26, 384, 9
+   SAVE_VSX 27, 400, 9
+   SAVE_VSX 28, 416, 9
+   SAVE_VSX 29, 432, 9
+   SAVE_VSX 30, 448, 9
+   SAVE_VSX 31, 464, 9
+.endm # SAVE_REGS
+
+.macro RESTORE_REGS
+   addi9, 1, 256
+   RESTORE_VRS 20, 0, 9
+   RESTORE_VRS 21, 16, 9
+   RESTORE_VRS 22, 32, 9
+   RESTORE_VRS 23, 48, 9
+   RESTORE_VRS 24, 64, 9
+   RESTORE_VRS 25, 80, 9
+   RESTORE_VRS 26, 96, 9
+   RESTORE_VRS 27, 112, 9
+   RESTORE_VRS 28, 128, 9
+   RESTORE_VRS 29, 144, 9
+   RESTORE_VRS 30, 160, 9
+   RESTORE_VRS 31, 176, 9
+
+   RESTORE_VSX 14, 192, 9
+   RESTORE_VSX 15, 208, 9
+   RESTORE_VSX 16, 224, 9
+   RESTORE_VSX 17, 240, 9
+   RESTORE_VSX 18, 256, 9
+   RESTORE_VSX 19, 272, 9
+   RESTORE_VSX 20, 288, 9
+   RESTORE_VSX 21, 304, 9
+   RESTORE_VSX 22, 320, 9
+   RESTORE_VSX 23, 336, 9
+   RESTORE_VSX 24, 352, 9
+   RESTORE_VSX 25, 368, 9
+   RESTORE_VSX 26, 384, 9
+   RESTORE_VSX 27, 400, 9
+   RESTORE_VSX 28, 416, 9
+   RESTORE_VSX 29, 432, 9
+   RESTORE_VSX 30, 448, 9
+   RESTORE_VSX 31, 464, 9
+
+   RESTORE_GPR 14, 112, 1
+   RESTORE_GPR 15, 120, 1
+   RESTORE_GPR 16, 128, 1
+   RESTORE_GPR 17, 136, 1
+   RESTORE_GPR 18, 144, 1
+