Re: [Aarch64] Optimize SHA1 Compress

Maamoun TK Sun, 23 May 2021 02:38:43 -0700

On Sun, May 23, 2021 at 10:52 AM Niels Möller <ni...@lysator.liu.se> wrote:


> Maamoun TK <maamoun...@googlemail.com> writes:
>
> > This patch optimizes SHA1 compress function for arm64 architecture by
> > taking advantage of SHA-1 instructions of Armv8 crypto extension.
> > The SHA-1 instructions:
> > SHA1C: SHA1 hash update (choose)
> > SHA1H: SHA1 fixed rotate
> > SHA1M: SHA1 hash update (majority)
> > SHA1P: SHA1 hash update (parity)
> > SHA1SU0: SHA1 schedule update 0
> > SHA1SU1: SHA1 schedule update 1
>
> Can you add this brief summary of instructions as a comment in the asm
> file?
>

Done! I'll attach a patch at the end of the message that performs slightly
better as well.

         Algorithm         mode Mbyte/s
              sha1       update  800.80
      openssl sha1       update  849.17
         hmac-sha1     64 bytes  166.10
         hmac-sha1    256 bytes  409.24
         hmac-sha1   1024 bytes  636.98
         hmac-sha1   4096 bytes  739.20
         hmac-sha1   single msg  775.67

> Benchmark on gcc117 instance of CFarm before applying the patch:
> >          Algorithm         mode        Mbyte/s
> >          sha1               update       214.16
> >          openssl sha1  update       849.44
>
> > Benchmark on gcc117 instance of CFarm after applying the patch:
> >          Algorithm         mode        Mbyte/s
> >          sha1                update       795.57
> >          openssl sha1   update       849.25
>
> Great speedup! Any idea why openssl is still slightly faster?
>

Sure, OpenSSL implementation uses a loop inside SH1 update function which
eliminates the constant initialization and state loading/sotring for each
block while nettle does that for every block iteration.


> > +define(`TMP0', `v21')
> > +define(`TMP1', `v22')
>
> Not sure I understand how these are used, but it looks like the TMP
> variables are used in some way for the message expansion state? E.g.,
> TMP0 assigned in the code for rounds 0-3, and this value used in the
> code for rounds 8-11. Other implementations don't need extra state for
> this, but just modifies the 16 message words in-place.
>

Modifying the message words in-place will change the value used by
'sha1su0' and 'sha1su1' instructions. According to ARM® A64 Instruction Set
Architecture:
SHA1SU0 <Vd>.4S, <Vn>.4S, <Vm>.4S
<Vd> Is the name of the SIMD&FP source and destination register
.
.

SHA1SU1 <Vd>.4S, <Vn>.4S
<Vd> Is the name of the SIMD&FP source and destination register
.
.

So using TMP variable is necessary here. I can't think of any replacement,
let me know how the other implementations handle this case.

It would be nice to either make the TMP registers more temporary (i.e.,
> no round depends on the value in these registers from previous rounds)
> and keep needed state only on the MSG variables. Or rename them to give
> a better hint on how they're used.
>

Done! Yield a slight performance increase btw.


> > +C void nettle_sha1_compress(uint32_t *state, const uint8_t *input)
> > +
> > +PROLOGUE(nettle_sha1_compress)
> > +    C Initialize constants
> > +    mov            w2,#0x7999
> > +    movk           w2,#0x5A82,lsl #16
> > +    dup            CONST0.4s,w2
> > +    mov            w2,#0xEBA1
> > +    movk           w2,#0x6ED9,lsl #16
> > +    dup            CONST1.4s,w2
> > +    mov            w2,#0xBCDC
> > +    movk           w2,#0x8F1B,lsl #16
> > +    dup            CONST2.4s,w2
> > +    mov            w2,#0xC1D6
> > +    movk           w2,#0xCA62,lsl #16
> > +    dup            CONST3.4s,w2
>
> Maybe would be clearer or more efficient to load these from memory? Not
> sure if there's an nice and consice way to load the four 32-bit values
> into a 128-bit, and then copy/duplicate them into the four const
> registers.
>

We can load all the constants (including duplicate values) from memory with
one instruction. The issue is how to get the data address properly for
every supported abi! By far I saw solutions with multiple paths for
different abi which I don't really like, the easiest solution is to define
the data in the .text section to make sure the address is near enough to be
loaded with certain instruction. Do you want to do that?


> > +    C Load message
> > +    ld1            {MSG0.16b,MSG1.16b,MSG2.16b,MSG3.16b},[INPUT]
> > +
> > +    C Reverse for little endian
> > +    rev32          MSG0.16b,MSG0.16b
> > +    rev32          MSG1.16b,MSG1.16b
> > +    rev32          MSG2.16b,MSG2.16b
> > +    rev32          MSG3.16b,MSG3.16b
>
> How does this work on big-endian? The ld1 with .16b is endian-neutral
> (according to the README), that means we always get the wrong order, and
> then we do unconditional byteswapping? Maybe add a comment. Not sure if
> it's worth the effort to make it work differently (ld1 .4w on
> big-endian)? It's going to be a pretty small fraction of the per-block
> processing.
>

 We have an intensive discussion about that in the GCM patch. The short
story, this patch should work well for both endianness modes. However, it's
not the same way we use in GCM patch to handle the endianness variation, to
follow GCM patch way we can do:

    C Load message
    ld1            {MSG0.4s,MSG1.4s,MSG2.4s,MSG3.4s},[INPUT]

    C Reverse for little endian
IF_LE(`
    rev32          MSG0.16b,MSG0.16b
    rev32          MSG1.16b,MSG1.16b
    rev32          MSG2.16b,MSG2.16b
    rev32          MSG3.16b,MSG3.16b
')

regards,
Mamone

---
 arm64/crypto/sha1-compress.asm | 93
+++++++++++++++++++++++-------------------
 1 file changed, 50 insertions(+), 43 deletions(-)

diff --git a/arm64/crypto/sha1-compress.asm b/arm64/crypto/sha1-compress.asm
index f261c93d..9f7d9f37 100644
--- a/arm64/crypto/sha1-compress.asm
+++ b/arm64/crypto/sha1-compress.asm
@@ -30,6 +30,15 @@ ifelse(`
    not, see http://www.gnu.org/licenses/.
 ')

+C This implementation uses the SHA-1 instructions of Armv8 crypto
+C extension.
+C SHA1C: SHA1 hash update (choose)
+C SHA1H: SHA1 fixed rotate
+C SHA1M: SHA1 hash update (majority)
+C SHA1P: SHA1 hash update (parity)
+C SHA1SU0: SHA1 schedule update 0
+C SHA1SU1: SHA1 schedule update 1
+
 .file "sha1-compress.asm"
 .arch armv8-a+crypto

@@ -53,8 +62,7 @@ define(`ABCD_SAVED', `v17')
 define(`E0', `v18')
 define(`E0_SAVED', `v19')
 define(`E1', `v20')
-define(`TMP0', `v21')
-define(`TMP1', `v22')
+define(`TMP', `v21')

 C void nettle_sha1_compress(uint32_t *state, const uint8_t *input)

@@ -92,140 +100,139 @@ PROLOGUE(nettle_sha1_compress)
     rev32          MSG2.16b,MSG2.16b
     rev32          MSG3.16b,MSG3.16b

-    add            TMP0.4s,MSG0.4s,CONST0.4s
-    add            TMP1.4s,MSG1.4s,CONST0.4s
-
     C Rounds 0-3
+    add            TMP.4s,MSG0.4s,CONST0.4s
     sha1h          SFP(E1),SFP(ABCD)
-    sha1c          QFP(ABCD),SFP(E0),TMP0.4s
-    add            TMP0.4s,MSG2.4s,CONST0.4s
+    sha1c          QFP(ABCD),SFP(E0),TMP.4s
     sha1su0        MSG0.4s,MSG1.4s,MSG2.4s

     C Rounds 4-7
+    add            TMP.4s,MSG1.4s,CONST0.4s
     sha1h          SFP(E0),SFP(ABCD)
-    sha1c          QFP(ABCD),SFP(E1),TMP1.4s
-    add            TMP1.4s,MSG3.4s,CONST0.4s
+    sha1c          QFP(ABCD),SFP(E1),TMP.4s
     sha1su1        MSG0.4s,MSG3.4s
     sha1su0        MSG1.4s,MSG2.4s,MSG3.4s

     C Rounds 8-11
+    add            TMP.4s,MSG2.4s,CONST0.4s
     sha1h          SFP(E1),SFP(ABCD)
-    sha1c          QFP(ABCD),SFP(E0),TMP0.4s
-    add            TMP0.4s,MSG0.4s,CONST0.4s
+    sha1c          QFP(ABCD),SFP(E0),TMP.4s
     sha1su1        MSG1.4s,MSG0.4s
     sha1su0        MSG2.4s,MSG3.4s,MSG0.4s

     C Rounds 12-15
+    add            TMP.4s,MSG3.4s,CONST0.4s
     sha1h          SFP(E0),SFP(ABCD)
-    sha1c          QFP(ABCD),SFP(E1),TMP1.4s
-    add            TMP1.4s,MSG1.4s,CONST1.4s
+    sha1c          QFP(ABCD),SFP(E1),TMP.4s
     sha1su1        MSG2.4s,MSG1.4s
     sha1su0        MSG3.4s,MSG0.4s,MSG1.4s

     C Rounds 16-19
+    add            TMP.4s,MSG0.4s,CONST0.4s
     sha1h          SFP(E1),SFP(ABCD)
-    sha1c          QFP(ABCD),SFP(E0),TMP0.4s
-    add            TMP0.4s,MSG2.4s,CONST1.4s
+    sha1c          QFP(ABCD),SFP(E0),TMP.4s
     sha1su1        MSG3.4s,MSG2.4s
     sha1su0        MSG0.4s,MSG1.4s,MSG2.4s

     C Rounds 20-23
+    add            TMP.4s,MSG1.4s,CONST1.4s
     sha1h          SFP(E0),SFP(ABCD)
-    sha1p          QFP(ABCD),SFP(E1),TMP1.4s
-    add            TMP1.4s,MSG3.4s,CONST1.4s
+    sha1p          QFP(ABCD),SFP(E1),TMP.4s
     sha1su1        MSG0.4s,MSG3.4s
     sha1su0        MSG1.4s,MSG2.4s,MSG3.4s

     C Rounds 24-27
+    add            TMP.4s,MSG2.4s,CONST1.4s
     sha1h          SFP(E1),SFP(ABCD)
-    sha1p          QFP(ABCD),SFP(E0),TMP0.4s
-    add            TMP0.4s,MSG0.4s,CONST1.4s
+    sha1p          QFP(ABCD),SFP(E0),TMP.4s
     sha1su1        MSG1.4s,MSG0.4s
     sha1su0        MSG2.4s,MSG3.4s,MSG0.4s

     C Rounds 28-31
+    add            TMP.4s,MSG3.4s,CONST1.4s
     sha1h          SFP(E0),SFP(ABCD)
-    sha1p          QFP(ABCD),SFP(E1),TMP1.4s
-    add            TMP1.4s,MSG1.4s,CONST1.4s
+    sha1p          QFP(ABCD),SFP(E1),TMP.4s
     sha1su1        MSG2.4s,MSG1.4s
     sha1su0        MSG3.4s,MSG0.4s,MSG1.4s

     C Rounds 32-35
+    add            TMP.4s,MSG0.4s,CONST1.4s
     sha1h          SFP(E1),SFP(ABCD)
-    sha1p          QFP(ABCD),SFP(E0),TMP0.4s
-    add            TMP0.4s,MSG2.4s,CONST2.4s
+    sha1p          QFP(ABCD),SFP(E0),TMP.4s
     sha1su1        MSG3.4s,MSG2.4s
     sha1su0        MSG0.4s,MSG1.4s,MSG2.4s

     C Rounds 36-39
+    add            TMP.4s,MSG1.4s,CONST1.4s
     sha1h          SFP(E0),SFP(ABCD)
-    sha1p          QFP(ABCD),SFP(E1),TMP1.4s
-    add            TMP1.4s,MSG3.4s,CONST2.4s
+    sha1p          QFP(ABCD),SFP(E1),TMP.4s
     sha1su1        MSG0.4s,MSG3.4s
     sha1su0        MSG1.4s,MSG2.4s,MSG3.4s

     C Rounds 40-43
+    add            TMP.4s,MSG2.4s,CONST2.4s
     sha1h          SFP(E1),SFP(ABCD)
-    sha1m          QFP(ABCD),SFP(E0),TMP0.4s
-    add            TMP0.4s,MSG0.4s,CONST2.4s
+    sha1m          QFP(ABCD),SFP(E0),TMP.4s
     sha1su1        MSG1.4s,MSG0.4s
     sha1su0        MSG2.4s,MSG3.4s,MSG0.4s

     C Rounds 44-47
+    add            TMP.4s,MSG3.4s,CONST2.4s
     sha1h          SFP(E0),SFP(ABCD)
-    sha1m          QFP(ABCD),SFP(E1),TMP1.4s
-    add            TMP1.4s,MSG1.4s,CONST2.4s
+    sha1m          QFP(ABCD),SFP(E1),TMP.4s
     sha1su1        MSG2.4s,MSG1.4s
     sha1su0        MSG3.4s,MSG0.4s,MSG1.4s

     C Rounds 48-51
+    add            TMP.4s,MSG0.4s,CONST2.4s
     sha1h          SFP(E1),SFP(ABCD)
-    sha1m          QFP(ABCD),SFP(E0),TMP0.4s
-    add            TMP0.4s,MSG2.4s,CONST2.4s
+    sha1m          QFP(ABCD),SFP(E0),TMP.4s
     sha1su1        MSG3.4s,MSG2.4s
     sha1su0        MSG0.4s,MSG1.4s,MSG2.4s

     C Rounds 52-55
+    add            TMP.4s,MSG1.4s,CONST2.4s
     sha1h          SFP(E0),SFP(ABCD)
-    sha1m          QFP(ABCD),SFP(E1),TMP1.4s
-    add            TMP1.4s,MSG3.4s,CONST3.4s
+    sha1m          QFP(ABCD),SFP(E1),TMP.4s
     sha1su1        MSG0.4s,MSG3.4s
     sha1su0        MSG1.4s,MSG2.4s,MSG3.4s

     C Rounds 56-59
+    add            TMP.4s,MSG2.4s,CONST2.4s
     sha1h          SFP(E1),SFP(ABCD)
-    sha1m          QFP(ABCD),SFP(E0),TMP0.4s
-    add            TMP0.4s,MSG0.4s,CONST3.4s
+    sha1m          QFP(ABCD),SFP(E0),TMP.4s
     sha1su1        MSG1.4s,MSG0.4s
     sha1su0        MSG2.4s,MSG3.4s,MSG0.4s

     C Rounds 60-63
+    add            TMP.4s,MSG3.4s,CONST3.4s
     sha1h          SFP(E0),SFP(ABCD)
-    sha1p          QFP(ABCD),SFP(E1),TMP1.4s
-    add            TMP1.4s,MSG1.4s,CONST3.4s
+    sha1p          QFP(ABCD),SFP(E1),TMP.4s
     sha1su1        MSG2.4s,MSG1.4s
     sha1su0        MSG3.4s,MSG0.4s,MSG1.4s

     C Rounds 64-67
+    add            TMP.4s,MSG0.4s,CONST3.4s
     sha1h          SFP(E1),SFP(ABCD)
-    sha1p          QFP(ABCD),SFP(E0),TMP0.4s
-    add            TMP0.4s,MSG2.4s,CONST3.4s
+    sha1p          QFP(ABCD),SFP(E0),TMP.4s
     sha1su1        MSG3.4s,MSG2.4s
     sha1su0        MSG0.4s,MSG1.4s,MSG2.4s

     C Rounds 68-71
+    add            TMP.4s,MSG1.4s,CONST3.4s
     sha1h          SFP(E0),SFP(ABCD)
-    sha1p          QFP(ABCD),SFP(E1),TMP1.4s
-    add            TMP1.4s,MSG3.4s,CONST3.4s
+    sha1p          QFP(ABCD),SFP(E1),TMP.4s
     sha1su1        MSG0.4s,MSG3.4s

     C Rounds 72-75
+    add            TMP.4s,MSG2.4s,CONST3.4s
     sha1h          SFP(E1),SFP(ABCD)
-    sha1p          QFP(ABCD),SFP(E0),TMP0.4s
+    sha1p          QFP(ABCD),SFP(E0),TMP.4s

     C Rounds 76-79
+    add            TMP.4s,MSG3.4s,CONST3.4s
     sha1h          SFP(E0),SFP(ABCD)
-    sha1p          QFP(ABCD),SFP(E1),TMP1.4s
+    sha1p          QFP(ABCD),SFP(E1),TMP.4s

     C Combine state
     add            E0.4s,E0.4s,E0_SAVED.4s

-- 
2.25.1
_______________________________________________
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Re: [Aarch64] Optimize SHA1 Compress

Reply via email to