On Fri, 2024-06-07 at 14:08 +0200, Niels Möller wrote: > Eric Richter <eric...@linux.ibm.com> writes: > > +C ROUND(A B C D E F G H R EXT) > > +define(`ROUND', ` > > + > > + vadduwm VT1, VK, IV($9) C VT1: k+W > > + vadduwm VT4, $8, VT1 C VT4: H+k+W > > + > > + lxvw4x VSR(VK), TK, K C Load Key > > + addi TK, TK, 4 C Increment Pointer to next key > > + > > + vadduwm VT2, $4, $8 C VT2: H+D > > + vadduwm VT2, VT2, VT1 C VT2: H+D+k+W > > Could the above two instructions be changed to > > vadduwm VT2, VT4, $4 C Should be the same,(H+k+W) + D > > (which would need one less register)? I realize there's slight change > in > the dependency chain. Do you know how many cycles one of these rounds > takes, and what's the bottleneck (I would guess either latency of the > dependency chain between rounds, or throughput of one of the > execution > units, or instruction issue rate). >
Theoretically it should be about 10 cycles per round, but the actual measured performance doesn't quite hit that due to various quirks with scheduling. With this change, I'm getting about a +1 MB/s gain on hmac 256 bytes, but a slight loss of speed for the rest. > > +define(`LOAD', ` > > + IF_BE(`lxvw4x VSR(IV($1)), m4_unquote(TC`'eval(($1 % 4) * 4)), > > INPUT') > > + IF_LE(` > > + lxvd2x VSR(IV($1)), m4_unquote(TC`'eval(($1 % 4) * 4)), INPUT > > + vperm IV($1), IV($1), IV($1), VT0 > > + ') > > +') > > + > > +define(`DOLOADS', ` > > + IF_LE(`DATA_LOAD_VEC(VT0, .load_swap, T1)') > > + LOAD(0) > > + LOAD(1) > > + LOAD(2) > > + LOAD(3) > > If you pass the right TCx register as argument to the load macro, you > don't need the m4 eval thing, which could make it a bit more > readable, imo. > > > + C Store non-volatile registers > > + > > + li T0, -8 > > + li T1, -24 > > + stvx v20, T0, SP > > + stvx v21, T1, SP > > + subi T0, T0, 32 > > + subi T1, T1, 32 > > This could be probably be arranged with fewer instructions by having > one > register that is decremented as we move down in the guard area, and > registers with constant values for indexing. > > > + C Reload initial state from VSX registers > > + xxlor VSR(VT0), VSXA, VSXA > > + xxlor VSR(VT1), VSXB, VSXB > > + xxlor VSR(VT2), VSXC, VSXC > > + xxlor VSR(VT3), VSXD, VSXD > > + xxlor VSR(VT4), VSXE, VSXE > > + xxlor VSR(SIGA), VSXF, VSXF > > + xxlor VSR(SIGE), VSXG, VSXG > > + xxlor VSR(VK), VSXH, VSXH > > + > > + vadduwm VSA, VSA, VT0 > > + vadduwm VSB, VSB, VT1 > > + vadduwm VSC, VSC, VT2 > > + vadduwm VSD, VSD, VT3 > > + vadduwm VSE, VSE, VT4 > > + vadduwm VSF, VSF, SIGA > > + vadduwm VSG, VSG, SIGE > > + vadduwm VSH, VSH, VK > > It's a pity that there seems to be no useful xxadd* instructions? Do > you > need all eight temporary registers, or would you get the same speed > doing just four at a time, i.e., 4 xxlor instructions, 4 vadduwm, 4 > xxlor, 4 vadduwm? There's no alias "xxmov" or the like that could be > used instead of xxlor? > Unfortunately most of the VSX instructions (particularly those in the p8 ISA) are for floating point operations, using them in this way is a bit of a hack. I'll test four at a time, but it will likely be similar performance unless the xxlor's are issued on a different unit. I'm not aware of an xxmov/xxmr extended mnemonic, but this could always be macroed instead for clarity. > Thanks for the update! > /Niels > Thanks for merging! I'll have a clean-up patch up soon, hopefully with the SHA512 implementation as well. _______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se