Re: [AArch64] Optimize GHASH

Michael Weiser Thu, 21 Jan 2021 15:45:27 -0800

Hello Mamone,

On Wed, Jan 20, 2021 at 10:25:19PM +0200, Maamoun TK wrote:


> I'm trying to install Gentoo on VMware by walking through this receip
> https://medium.com/@steensply/vmware-installation-of-gentoo-linux-from-scratch-on-an-encrypted-partition-9e4665f638e2
> I'm in the middle of receip now but there a lot of instruction there so I'm
> gonna get the os working in the end.

As far as I can tell that recipe only encompasses basic installation.
You'd additionally need to run crossdev to create a cross-toolchain and
then install qemu as well. Gentoo has a very steep learning curve. There's
no benefit compared to buildroot for our use-case here, IMO.

> Here how I get the vector instruction operate on registers in LE mode, i'll
> take this instruction as example: pmull  v0.1q,v1.1d,v2.1d
> Input represented as indexes
> v1: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> v2: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> the instruction byte-reverse each of 64-bit parts of register so the
> instruction consider the register as follow
> v1: 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8
> v2: 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8
> so what I did in LE mode is reverse the 64-bit parts before execute the
> doublework operation using rev64 instruction, according to that the pmull
> output will be 128-bit byte-reversed
> Output represented as indexes
> v0: 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

> What I'm assuming in BE mode is operations are performed in normal way in
> registers side so no need to reverse the inputs in addition to get normal
> output hence the macros "REDUCTION" and "PMUL_PARAM" have differences in
> their structure, it's not matter of zip instruction perform better but how
> to handle the weird situation in LE mode.

I've tried for a number of hours to make this work today. Always when I
added correct handling of the transposed doublewords to one macro,
another broke down. To me the problem comes down to this: ldr
HQ,[TABLE...] and st1.16b are fighting each other and can't be brought
together without a lot of additional instructions (at least not by me).

Longer story: ldr does a 128bit load. This loads bytes in exactly
reverse order into the register on LE and BE. As you describe above, the
macros for LE expect a state which is neither of those: The bytes
transposed as if BE but the doublewords as loaded on LE. For BE this
poses the oppositve problem: It natively loads bytes in the order LE has
to reproduce using rev64 but then needs to reproduce the doubleword
order of LE for the LE routines to work or basically have native BE
routines.

That's what my last pedestrian change did. After today I'd perhaps write
it like this (untested):

@@ -125,10 +135,12 @@ IF_BE(`

 PROLOGUE(_nettle_gcm_init_key)
     ldr            HQ,[TABLE,#16*H_Idx]
-    dup            EMSB.16b,H.b[0]
 IF_LE(`
     rev64          H.16b,H.16b
+',`
+    ext            H.16b,H.16b,H.16b,#8
 ')
+    dup            EMSB.16b,H.b[7]
     mov            x1,#0xC200000000000000
     mov            x2,#1
     mov            POLY.d[0],x1

When trying to cater to the current layout on LE, all the other vectors
which are later used in conjunction with H to be reversed as well. That
leads to this diff to your initial patch:

@@ -125,14 +135,21 @@ IF_BE(`

 PROLOGUE(_nettle_gcm_init_key)
     ldr            HQ,[TABLE,#16*H_Idx]
-    dup            EMSB.16b,H.b[0]
 IF_LE(`
+    dup            EMSB.16b,H.b[0]
     rev64          H.16b,H.16b
+',`
+    dup            EMSB.16b,H.b[15]
 ')
     mov            x1,#0xC200000000000000
     mov            x2,#1
+IF_LE(`
     mov            POLY.d[0],x1
     mov            POLY.d[1],x2
+',`
+    mov            POLY.d[1],x1
+    mov            POLY.d[0],x2
+')
     sshr           EMSB.16b,EMSB.16b,#7
     and            EMSB.16b,EMSB.16b,POLY.16b
     ushr           B.2d,H.2d,#63
@@ -142,7 +159,11 @@ IF_LE(`
     orr            H.16b,H.16b,B.16b
     eor            H.16b,H.16b,EMSB.16b

+IF_LE(`
     dup            POLY.2d,POLY.d[0]
+',`
+    dup            POLY.2d,POLY.d[1]
+')

     C --- calculate H^2 = H*H ---

The difference in index in dup EMSB nicely shows the doubleword
transposition compared to LE. If on LE the dup was done after the rev64,
it'd be H.b[7] vs. H.b[15].

With this layout PMUL_PARAM can work on H and POLY but then needs to use
pmull instead of pmull2 because the relevant data is in the other
doublewords compared to LE. On the other hand, since the output of
PMUL_PARAM is to be stored using st1.16b it must not have the
doublewords transposed ("load-inverted" I termed it in the comments ;).
That leads to the following adjustment and makes the first 16bytes of
TABLE identical to LE:

@@ -109,11 +118,12 @@ define(`H4L', `v30')

 .macro PMUL_PARAM in, param1, param2
 IF_BE(`
-    pmull2         Hp.1q,\in\().2d,POLY.2d
+    pmull          Hp.1q,\in\().1d,POLY.1d
     ext            Hm.16b,\in\().16b,\in\().16b,#8
     eor            Hm.16b,Hm.16b,Hp.16b
-    zip            \param1\().2d,\in\().2d,Hm.2d
-    zip2           \param2\().2d,\in\().2d,Hm.2d
+    C output must be in native register order (not load-inverted) for st1.16b 
to work
+    zip2           \param1\().2d,\in\().2d,Hm.2d
+    zip1           \param2\().2d,\in\().2d,Hm.2d
 ',`
     pmull2         Hp.1q,\in\().2d,POLY.2d
     eor            Hm.16b,\in\().16b,Hp.16b

In PMUL is where it breaks down, at least for my brain: Its first call
is handed H (which has doublewords "transposed" from the initial ldr) and
H1M and H1L (which must not have doublewords transposed so st1.16b
writes them to memory in correct order). It wants to pmull/pmull2 them
which requires corresponding doublewords at the same index. So we'd
need to temporarily transpose \in for that:

@@ -46,25 +46,34 @@ define(`R1', `v19')

 C common macros:
 .macro PMUL in, param1, param2
-    pmull          F.1q,\param2\().1d,\in\().1d
-    pmull2         F1.1q,\param2\().2d,\in\().2d
-    pmull          R.1q,\param1\().1d,\in\().1d
-    pmull2         R1.1q,\param1\().2d,\in\().2d
+    C PMUL_PARAM left us with \param1 and \param2 in native register order but
+    C \in is load-inverted from initial load of H using ldr, something must 
give
+IF_BE(`
+    ext            T.16b,\in\().16b,\in\().16b,#8
+',`
+    mov            T.16b,\in\().16b
+')
+    pmull          F.1q,\param2\().1d,T.1d
+    pmull2         F1.1q,\param2\().2d,T.2d
+    pmull          R.1q,\param1\().1d,T.1d
+    pmull2         R1.1q,\param1\().2d,T.2d
     eor            F.16b,F.16b,F1.16b
     eor            R.16b,R.16b,R1.16b
 .endm

If we finally artificially restore the doubleword transposition in
REDUCE for H2 and H3 we're all set for the next calls:

 .macro REDUCTION out
 IF_BE(`
-    pmull          T.1q,F.1d,POLY.1d
     ext            \out\().16b,F.16b,F.16b,#8
-    eor            R.16b,R.16b,T.16b
-    eor            \out\().16b,\out\().16b,R.16b
+    pmull2         T.1q,\out\().2d,POLY.2d
 ',`
     pmull          T.1q,F.1d,POLY.1d
+')
     eor            R.16b,R.16b,T.16b
     ext            R.16b,R.16b,R.16b,#8
     eor            \out\().16b,F.16b,R.16b
+C artificially restore load inversion for PMUL_PARAM :-(
+IF_BE(`
+    ext            \out\().16b,\out\().16b,\out\().16b,#8
 ')
 .endm

So all we're doing is catering to the quirk of the very first ldr
operation. The easiest solution seems to me to align all types of load
and store operations with each other or counteract their quirks right
after or before executing them. That way we end up with identical
register contents on LE and BE and don't have to maintain separate
implementations.

That'd be in line with what we ended up with on arm32 NEON as well.
memxor3.asm does do the dance of working with different register content
but there it's only bitwise operations and the load and store operations
have identical behaviour.

The advantage of the current implementation with transposed doublewords
and only the LE routines seems to me that overhead on LE and BE would
be about the same.

Do you think it makes sense to try and adjust the code to work with the
BE layout natively and have a full 128bit reverse after ldr-like loads
on LE instead (considering that 99,999% of aarch64 users will run LE)?

> > Otherwise, what's your error message from podman? It's got no deamon, so
> > it shouldn't need a socket to connect to it like docker does. Out to the
> > Internet for image download it's also a standard client and respects
> > environment variables for proxies as usual.
> >
> >
> I got Error: error creating network namespace for container. I think I can
> fix it by tracing the problem but let's try the other methods first as I
> think it's gonna be simpler for me..

I found this error on the Net in conjunction with a Debian/Ubuntu
security-related custom kernel knob for disabling unprivileged user
namespaces that was enabled by default once. I tested that with Ubuntu
18.04, 20.04 and 20.10 yesterday and it's disabled (i.e. namespaces for
unprivileged users enabled) on all of them. You can still have a look at
your setting in /proc/sys/kernel/unprivileged_userns_clone or with
sysctl kernel.unprivileged_userns_clone. It needs to be set to 1 for
rootless podman to work.

You're not by any chance running the Windows Subsystem for Linux (WSL)?
https://github.com/containers/podman/issues/3288#issuecomment-501356136 :)

Or inside another container at a hosting service?
https://github.com/containers/podman/issues/4056

Otherwise I have no idea what could be causing that and have never seen
that error.
-- 
Thanks,
Michael
_______________________________________________
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Re: [AArch64] Optimize GHASH

Reply via email to