Re: [PATCH v2 1/2] powerpc64: Add optimized assembly for sha256-compress-n

Eric Richter Tue, 07 May 2024 13:33:28 -0700

On Sun, 2024-05-05 at 16:10 +0200, Niels Möller wrote:
> Eric Richter <eric...@linux.ibm.com> writes:
> 
> > This patch introduces an optimized powerpc64 assembly
> > implementation for
> > sha256-compress-n. This takes advantage of the vshasigma
> > instruction, as
> > well as unrolling loops to best take advantage of running
> > instructions
> > in parallel.
> 
> Thanks. I'm now having a closer read of the assembly code. Comments
> below.
> 
> > +C ROUND(A B C D E F G H R EXT)
> > +define(`ROUND', `
> > +
> > +   vadduwm VT1, VK, IV($9)               C VT1: k+W
> > +   vadduwm VT4, $8, VT1                  C VT4: H+k+W
> > +
> > +   lxvw4x  VSR(VK), TK, K                C Load Key
> > +   addi    TK, TK, 4                     C Increment Pointer
> > to next key
> > +
> > +   vadduwm VT2, $4, $8                   C VT2: H+D
> > +   vadduwm VT2, VT2, VT1                 C VT2:
> > H+D+k+W
> > +
> > +   vshasigmaw      SIGE, $5, 1, 0b1111   C Sigma(E)  Se
> > +   vshasigmaw      SIGA, $1, 1, 0        C Sigma(A)  Sa
> > +
> > +   vxor    VT3, $2, $3                   C VT3: b^c
> > +   vsel    VT0, $7, $6, $5               C VT0: Ch.
> > +   vsel    VT3, $3, $1, VT3              C VT3: Maj(a,b,c)
> > +
> > +   vadduwm VT4, VT4, VT0                 C VT4: Hkw +
> > Ch.
> > +   vadduwm VT3, VT3, VT4                 C VT3: HkW +
> > Ch. + Maj.
> > +
> > +   vadduwm VT0, VT0, VT2                 C VT0: Ch. +
> > DHKW
> > +   vadduwm $8, SIGE, SIGA                C Anext: Se
> > + Sa
> > +   vadduwm $4, VT0, SIGE                 C Dnext: Ch.
> > + DHKW + Se
> > +   vadduwm $8, $8, VT3                   C Anext:
> > Se+Sa+HkW+Ch.+Maj.
> > +
> > +
> > +   C Schedule (data) for 16th round in future
> > +   C Extend W[i]
> > +   ifelse(`$10', `1', `
> > +           vshasigmaw      SIGE, IV($9 + 14), 0, 0b1111
> > +           vshasigmaw      SIGA, IV($9 + 1), 0, 0b0000
> > +           vadduwm         IV($9), IV($9), SIGE
> > +           vadduwm         IV($9), IV($9), SIGA
> > +           vadduwm         IV($9), IV($9), IV($9 + 9)
> > +   ')
> > +')
> 
> I think it would be a bit simpler to take out the extend logic to its
> own macro. 
> 
> > +define(`EXTENDROUND',      `ROUND($1, $2, $3, $4, $5, $6, $7, $8, $9,
> > 1)')
> 
> If you do that, then you would define
>   
>   define(`EXTENDROUND',       `ROUND($1, $2, $3, $4, $5, $6, $7, $8, $9)
> EXTEND($9)')
> 
> (In other related code, input expansion is done at the beginning of a
> round iteration rather than at the end, but doing at the end like you
> do
> may be better scheduling).
>


Makes sense, I'll move that extend logic into its own macro.

You are correct, the expansion logic was moved to the end of the round
for an improvement to scheduling on the CPU. The vshasigma instructions
take more cycles and are scheduled on a different unit than the other
arithmetic operations. This allows those to work in parallel with the
beginning of the next round, as there are no dependent registers until
the next vshasigma instructions in-round.

> > +define(`LOAD', `
> > +   IF_BE(`lxvw4x   VSR(IV($1)), 0, INPUT')
> > +   IF_LE(`
> > +           lxvd2x  VSR(IV($1)), 0, INPUT
> > +           vperm   IV($1), IV($1), IV($1), VT0
> > +   ')
> > +   addi    INPUT, INPUT, 4
> > +')
> > +
> > +define(`DOLOADS', `
> > +   IF_LE(`DATA_LOAD_VEC(VT0, .load_swap, T1)')
> 
> Could you have a dedicated register for the permutation constant, and
> load it only once at function entry? If you have general registers to
> spare, it could also make sense to use, e.g., three registers for the
> contant values 16, 32, 48, and use for indexing. Then you don't need
> to
> update the INPUT pointer as often, and you can use the same constants
> for other load/store sequences as well.
> 

There are plenty of GPRs to spare, I will test and bench a few options
for using more GPRs as indexes.

As for VRs, unfortunately the current implementation uses all 32 VRs:
 16 for W[i]
 8 for state
 7 for round arithmetic (two of these specifically for sigma, to avoid
a dependency bubble)
 1 for storing the key constant K

That said, I'm going to experiment with some VSX instructions to see if
it is possible to spill over certain operations into VSRs, without
needing an explicit copy back from VSR to VR.

> > +   LOAD(0)
> > +   LOAD(1)
> > +   LOAD(2)
> > +   LOAD(3)
> 
> > +PROLOGUE(_nettle_sha256_compress_n)
> > +   cmpwi   0, NUMBLOCKS, 0
> > +   ble     0, .done
> > +   mtctr   NUMBLOCKS
> > +
> > +   C Store non-volatile registers
> > +   subi    SP, SP, 64+(12*16)
> > +   std     T0,     24(SP)
> > +   std     T1,     16(SP)
> > +   std     COUNT,  8(SP)
> 
> For save/restore of registers, I prefer to use the register names,
> not
> the defined symbols. And T0, T1, COUNT are defined to use r7, r8,
> r10,
> which *are* volatile, right?
> 

Ah yep, good catch!

> Does the data stored fit in the 288 byte "protected zone"? If so,
> probably best to not modify the stack pointer.
> 

At the moment it should as I'm currently moving the stack pointer 256
bytes. With the removal of the volatile GPRs and the addition of new
index GPRs, I think it should still fit in the 288 byte zone. I will
include this in the next version if it fits.

> > +   li      T0, 32
> > +   stvx    v20, 0, SP
> > +   subi    T0, T0, 16
> > +   stvx    v21, T0, SP
> 
> Here it would also help a bit to allocate constants 16, 32, 48 in
> registers.
> 
> > +   subi    T0, T0, 16
> > +   stvx    v22, T0, SP
> > +   subi    T0, T0, 16
> > +   stvx    v23, T0, SP
> > +   subi    T0, T0, 16
> > +   stvx    v24, T0, SP
> > +   subi    T0, T0, 16
> > +   stvx    v25, T0, SP
> > +   subi    T0, T0, 16
> > +   stvx    v26, T0, SP
> > +   subi    T0, T0, 16
> > +   stvx    v27, T0, SP
> > +   subi    T0, T0, 16
> > +   stvx    v28, T0, SP
> > +   subi    T0, T0, 16
> > +   stvx    v29, T0, SP
> > +   subi    T0, T0, 16
> > +   stvx    v30, T0, SP
> > +   subi    T0, T0, 16
> > +   stvx    v31, T0, SP
> > +
> > +   C Load state values
> > +   li      T0, 16
> > +   lxvw4x  VSR(VSA), 0, STATE      C VSA contains A,B,C,D
> > +   lxvw4x  VSR(VSE), T0, STATE     C VSE contains E,F,G,H
> > +
> > +.loop:
> > +   li      TK, 0
> > +   lxvw4x  VSR(VK), TK, K
> > +   addi    TK, TK, 4
> > +
> > +   DOLOADS
> > +
> > +   C "permute" state from VSA containing A,B,C,D into
> > VSA,VSB,VSC,VSD
> 
> Can you give a bit more detail on this permutation? Does the main
> round
> operations only use 32 bits each from the state registers? There's no
> reasonable way to use a more compact representation?
> 

Correct, state registers VSA, VSB, VSC, etc only use the first 32-bits
of their respective vector register for arithmetic in a round (64-bits
in the logically identical sha512).

Unfortunately compacting state values A,B,C,D into a single VR
introduces some performance problems when trying to do steps like:
   vsel   VT3, $3, $1, VT3  C VT3: Maj(a,b,c)
as this would require additional instructions extracting A,B,C into
three VRs.

So this is largely a scalar implementation using vector instructions,
as there is no scalar equivalent to vshasigma.

That said, this "permutation" allows the loading of four state values
in a single load instruction rather than four separate loads, with the
tradeoff of needing more arithmetic instructions to permute them into
place. I think this should end up being faster (combined with parallel
scheduling of the non-dependent vsldoi shifts), but I will run some
tests to confirm this.

> > +   vsldoi  VSB, VSA, VSA, 4
> > +   vsldoi  VSF, VSE, VSE, 4
> > +
> > +   vsldoi  VSC, VSA, VSA, 8
> > +   vsldoi  VSG, VSE, VSE, 8
> > +
> > +   vsldoi  VSD, VSA, VSA, 12
> > +   vsldoi  VSH, VSE, VSE, 12
> > +
> > +   EXTENDROUNDS
> > +   EXTENDROUNDS
> > +   EXTENDROUNDS
> > +   NOEXTENDROUNDS
> > +
> > +   C Reload initial state from stack
> > +   li      T0, 16
> > +   lxvw4x  VSR(VT0), 0, STATE      C VSA contains A,B,C,D
> > +   lxvw4x  VSR(VT1), T0, STATE     C VSE contains E,F,G,H
> > +
> > +   C Repack VSA,VSB,VSC,VSD into VSA,VSE for storing
> > +   vmrghw  VSA, VSA, VSB
> > +   vmrghw  VSC, VSC, VSD
> > +   vmrghw  VSE, VSE, VSF
> > +   vmrghw  VSG, VSG, VSH
> > +
> > +   xxmrghd VSR(VSA), VSR(VSA), VSR(VSC)
> > +   xxmrghd VSR(VSE), VSR(VSE), VSR(VSG)
> > +
> > +   vadduwm VSA, VSA, VT0
> > +   vadduwm VSE, VSE, VT1
> 
> It seems unfortunate to have to do this conversion for each iteration
> of
> the loop, it would be nice if state could be converted to the
> most efficient form before entering the loop, and not converted back
> until after loop exit. But we probably don't have enoguh registers to
> keep the old state exploded into many registers. And load/store of
> exploded state doesn't seem that attractive either.
> 

The necessary load/store per loop iteration is probably the worst
offender for performance, as the merges can be scheduled in parallel.
This is where I'd like to see if we can make the best use of VSRs to
temporarily hold the old state, as a VR <-> VSR move *should* be faster
than a load/store to memory.

> > +   li      T0, 16
> > +   stxvw4x VSR(VSA), 0, STATE
> > +   stxvw4x VSR(VSE), T0, STATE
> > +
> > +   bdnz    .loop
> 
> Regards,
> /Niels Möller
> 

Thanks for the review!

- Eric
_______________________________________________
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

Re: [PATCH v2 1/2] powerpc64: Add optimized assembly for sha256-compress-n

Reply via email to