On Fri, 2024-06-07 at 14:08 +0200, Niels Möller wrote:
> Eric Richter <eric...@linux.ibm.com> writes:
> > +C ROUND(A B C D E F G H R EXT)
> > +define(`ROUND', `
> > +
> > + vadduwm VT1, VK, IV($9)               C VT1: k+W
> > + vadduwm VT4, $8, VT1                  C VT4: H+k+W
> > +
> > + lxvw4x VSR(VK), TK, K                C Load Key
> > + addi TK, TK, 4               C Increment Pointer to next key
> > +
> > + vadduwm VT2, $4, $8               C VT2: H+D
> > + vadduwm VT2, VT2, VT1                 C VT2: H+D+k+W
> 
> Could the above two instructions be changed to
> 
>  vadduwm VT2, VT4, $4    C Should be the same,(H+k+W) + D
> 
> (which would need one less register)? I realize there's slight change
> in
> the dependency chain. Do you know how many cycles one of these rounds
> takes, and what's the bottleneck (I would guess either latency of the
> dependency chain between rounds, or throughput of one of the
> execution
> units, or instruction issue rate).
> 

Theoretically it should be about 10 cycles per round, but the actual
measured performance doesn't quite hit that due to various quirks with
scheduling.

With this change, I'm getting about a +1 MB/s gain on hmac 256 bytes,
but a slight loss of speed for the rest.

> > +define(`LOAD', `
> > + IF_BE(`lxvw4x VSR(IV($1)), m4_unquote(TC`'eval(($1 % 4) * 4)),
> > INPUT')
> > + IF_LE(`
> > + lxvd2x VSR(IV($1)), m4_unquote(TC`'eval(($1 % 4) * 4)), INPUT
> > + vperm IV($1), IV($1), IV($1), VT0
> > + ')
> > +')
> > +
> > +define(`DOLOADS', `
> > + IF_LE(`DATA_LOAD_VEC(VT0, .load_swap, T1)')
> > + LOAD(0)
> > + LOAD(1)
> > + LOAD(2)
> > + LOAD(3)
> 
> If you pass the right TCx register as argument to the load macro, you
> don't need the m4 eval thing, which could make it a bit more
> readable, imo.
> 
> > + C Store non-volatile registers
> > +
> > + li T0, -8
> > + li T1, -24
> > + stvx v20, T0, SP
> > + stvx v21, T1, SP
> > + subi T0, T0, 32
> > + subi T1, T1, 32
> 
> This could be probably be arranged with fewer instructions by having
> one
> register that is decremented as we move down in the guard area, and
> registers with constant values for indexing.
> 
> > + C Reload initial state from VSX registers
> > + xxlor VSR(VT0), VSXA, VSXA
> > + xxlor VSR(VT1), VSXB, VSXB
> > + xxlor VSR(VT2), VSXC, VSXC
> > + xxlor VSR(VT3), VSXD, VSXD
> > + xxlor VSR(VT4), VSXE, VSXE
> > + xxlor VSR(SIGA), VSXF, VSXF
> > + xxlor VSR(SIGE), VSXG, VSXG
> > + xxlor VSR(VK), VSXH, VSXH
> > +
> > + vadduwm VSA, VSA, VT0
> > + vadduwm VSB, VSB, VT1
> > + vadduwm VSC, VSC, VT2
> > + vadduwm VSD, VSD, VT3
> > + vadduwm VSE, VSE, VT4
> > + vadduwm VSF, VSF, SIGA
> > + vadduwm VSG, VSG, SIGE
> > + vadduwm VSH, VSH, VK
> 
> It's a pity that there seems to be no useful xxadd* instructions? Do
> you
> need all eight temporary registers, or would you get the same speed
> doing just four at a time, i.e., 4 xxlor instructions, 4 vadduwm, 4
> xxlor, 4 vadduwm? There's no alias "xxmov" or the like that could be
> used instead of xxlor?
> 

Unfortunately most of the VSX instructions (particularly those in the
p8 ISA) are for floating point operations, using them in this way is a
bit of a hack. I'll test four at a time, but it will likely be similar
performance unless the xxlor's are issued on a different unit.

I'm not aware of an xxmov/xxmr extended mnemonic, but this could always
be macroed instead for clarity.

> Thanks for the update!
> /Niels
> 

Thanks for merging! I'll have a clean-up patch up soon, hopefully with
the SHA512 implementation as well.
_______________________________________________
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

Reply via email to