On Wed, Nov 22, 2023 at 10:37 AM Danny Tsen <dt...@us.ibm.com> wrote:
> > > > On Nov 22, 2023, at 2:27 AM, Niels Möller <ni...@lysator.liu.se> wrote: > > > > Danny Tsen <dt...@us.ibm.com> writes: > > > >> Interleaving at the instructions level may be a good option but due to > >> PPC instruction pipeline this may need to have sufficient > >> registers/vectors. Use same vectors to change contents in successive > >> instructions may require more cycles. In that case, more > >> vectors/scalar will get involved and all vectors assignment may have > >> to change. That’s the reason I avoided in this case. > > > > To investigate the potential, I would suggest some experiments with > > software pipelining. > > > > Write a loop to do 4 blocks of ctr-aes128 at a time, fully unrolling the > > round loop. I think that should be 44 instructions of aes mangling, plus > > instructions to setup the counter input, and do the final xor and > > endianness things with the message. Arrange so that it loads the AES > > state in a set of registers we can call A, operating in-place on these > > registers. But at the end, arrange the XORing so that the final > > cryptotext is located in a different set of registers, B. > > > > Then, write the instructions to do ghash using the B registers as input, > > I think that should be about 20-25 instructions. Interleave those as > > well as possible with the AES instructions (say, two aes instructions, > > one ghash instruction, etc). > > > > Software pipelining means that each iteration of the loop does aes-ctr > > on four blocks, + ghash on the output for the four *previous* blocks (so > > one needs extra code outside of the loop to deal with first and last 4 > > blocks). Decrypt processing should be simpler. > > > > Then you can benchmark that loop in isolation. It doesn't need to be the > > complete function, the handling of first and last blocks can be omitted, > > and it doesn't even have to be completely correct, as long as it's the > > right instruction mix and the right data dependencies. The benchmark > > should give a good idea for the potential speedup, if any, from > > instruction-level interleaving. > This is a very ideal condition. Too much interleaving may not produce the > best results and different architectures may have different results. I had > tried various way when I implemented AES/GCM stitching functions for > OpenSSL. I’ll give it a try since your ghash function is different. > > > > > I would hope 4-way is doable with available vector registers (and this > > inner loop should be less than 100 instructions, so not too > > unmanageable). Going up to 8-way (like the current AES code) would also > > be interesting, but as you say, you might have a shortage of registers. > > If you have to copy state between registers and memory in each iteration > > of an 8-way loop (which it looks like you also have to do in your > > current patch), that overhead cost may outweight the gains you have from > > more independence in the AES rounds. > 4x unrolling may not produce the best performance. I did that when I > implemented this stitching function in OpenSSL and it’s in one assembly > file and no functions calls outside the function. Once again, calling a > function within a loop introduce a lot of overhead. Here are my past > results for your reference. First one is the original performance from > OpenSSL. The second one was the 4x unrolling and the third one was the > 8x. But I can try again. > > (This was run on a p10 with 3.5 GHz machine) > > AES-128-GCM 382128.50k 1023073.64k 2621489.41k 3604979.37k > 4018642.94k 4032080.55k > AES-128-GCM 347370.13k 1236054.06k 2778748.59k 3900567.21k > 4527158.61k 4579759.45k ( 4x AES and 4x ghash > ) > AES-128-GCM 356520.19k 989983.06k 2902907.56k 4379016.19k > 5180981.25k 5249717.59k ( 8x AES and 2 4x gha > sh combined) > Calls impose a lot of overhead on Power. And both the efficient loop instruction and the preferred indirect call instruction use the CTR register. Thanks, David > > Thanks. > -Danny > > > > > Regards, > > /Niels > > > > -- > > Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. > > Internet email is subject to wholesale government surveillance. > > _______________________________________________ > nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se > To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se > _______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se