Re: [AArch64] Optimize GHASH

Maamoun TK Fri, 22 Jan 2021 12:15:05 -0800

On Fri, Jan 22, 2021 at 1:45 AM Michael Weiser <michael.wei...@gmx.de>
wrote:


> Longer story: ldr does a 128bit load. This loads bytes in exactly
> reverse order into the register on LE and BE. As you describe above, the
> macros for LE expect a state which is neither of those: The bytes
> transposed as if BE but the doublewords as loaded on LE. For BE this
> poses the oppositve problem: It natively loads bytes in the order LE has
> to reproduce using rev64 but then needs to reproduce the doubleword
> order of LE for the LE routines to work or basically have native BE
> routines.
>
> That's what my last pedestrian change did. After today I'd perhaps write
> it like this (untested):
>
> @@ -125,10 +135,12 @@ IF_BE(`
>
>  PROLOGUE(_nettle_gcm_init_key)
>      ldr            HQ,[TABLE,#16*H_Idx]
> -    dup            EMSB.16b,H.b[0]
>  IF_LE(`
>      rev64          H.16b,H.16b
> +',`
> +    ext            H.16b,H.16b,H.16b,#8
>  ')
> +    dup            EMSB.16b,H.b[7]
>      mov            x1,#0xC200000000000000
>      mov            x2,#1
>      mov            POLY.d[0],x1
>
> When trying to cater to the current layout on LE, all the other vectors
> which are later used in conjunction with H to be reversed as well. That
> leads to this diff to your initial patch:
>
> @@ -125,14 +135,21 @@ IF_BE(`
>
>  PROLOGUE(_nettle_gcm_init_key)
>      ldr            HQ,[TABLE,#16*H_Idx]
> -    dup            EMSB.16b,H.b[0]
>  IF_LE(`
> +    dup            EMSB.16b,H.b[0]
>      rev64          H.16b,H.16b
> +',`
> +    dup            EMSB.16b,H.b[15]
>  ')
>      mov            x1,#0xC200000000000000
>      mov            x2,#1
> +IF_LE(`
>      mov            POLY.d[0],x1
>      mov            POLY.d[1],x2
> +',`
> +    mov            POLY.d[1],x1
> +    mov            POLY.d[0],x2
> +')
>      sshr           EMSB.16b,EMSB.16b,#7
>      and            EMSB.16b,EMSB.16b,POLY.16b
>      ushr           B.2d,H.2d,#63
> @@ -142,7 +159,11 @@ IF_LE(`
>      orr            H.16b,H.16b,B.16b
>      eor            H.16b,H.16b,EMSB.16b
>
> +IF_LE(`
>      dup            POLY.2d,POLY.d[0]
> +',`
> +    dup            POLY.2d,POLY.d[1]
> +')
>
>      C --- calculate H^2 = H*H ---
>
> The difference in index in dup EMSB nicely shows the doubleword
> transposition compared to LE. If on LE the dup was done after the rev64,
> it'd be H.b[7] vs. H.b[15].
>

I see what you did here, but I'm confused about ld1 and st1 instructions so
let me clarify one thing before going on, how do ld1 and st1 load and store
from/into memory in BE mode? If they perform in a normal way then there is
no point of using ldr at all, I just used it because it handles imm offset.
so to replace this line "ldr HQ,[TABLE,#16*H_Idx]" we can just add the
offset to the register that hold the address "add x1,TABLE,#16*H_Idx" then
load the H value by using ld1 "ld1 {H.16b},[x1]" in this way we can still
have to deal with LE as transposed doublewords and with BE in normal way
(not transposed doublewords or transposed quadword).

regards,
Mamone
_______________________________________________
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Re: [AArch64] Optimize GHASH

Reply via email to