> On Nov 22, 2023, at 2:27 AM, Niels Möller <ni...@lysator.liu.se> wrote: > > Danny Tsen <dt...@us.ibm.com> writes: > >> Interleaving at the instructions level may be a good option but due to >> PPC instruction pipeline this may need to have sufficient >> registers/vectors. Use same vectors to change contents in successive >> instructions may require more cycles. In that case, more >> vectors/scalar will get involved and all vectors assignment may have >> to change. That’s the reason I avoided in this case. > > To investigate the potential, I would suggest some experiments with > software pipelining. > > Write a loop to do 4 blocks of ctr-aes128 at a time, fully unrolling the > round loop. I think that should be 44 instructions of aes mangling, plus > instructions to setup the counter input, and do the final xor and > endianness things with the message. Arrange so that it loads the AES > state in a set of registers we can call A, operating in-place on these > registers. But at the end, arrange the XORing so that the final > cryptotext is located in a different set of registers, B. > > Then, write the instructions to do ghash using the B registers as input, > I think that should be about 20-25 instructions. Interleave those as > well as possible with the AES instructions (say, two aes instructions, > one ghash instruction, etc). > > Software pipelining means that each iteration of the loop does aes-ctr > on four blocks, + ghash on the output for the four *previous* blocks (so > one needs extra code outside of the loop to deal with first and last 4 > blocks). Decrypt processing should be simpler. > > Then you can benchmark that loop in isolation. It doesn't need to be the > complete function, the handling of first and last blocks can be omitted, > and it doesn't even have to be completely correct, as long as it's the > right instruction mix and the right data dependencies. The benchmark > should give a good idea for the potential speedup, if any, from > instruction-level interleaving. This is a very ideal condition. Too much interleaving may not produce the best results and different architectures may have different results. I had tried various way when I implemented AES/GCM stitching functions for OpenSSL. I’ll give it a try since your ghash function is different.
> > I would hope 4-way is doable with available vector registers (and this > inner loop should be less than 100 instructions, so not too > unmanageable). Going up to 8-way (like the current AES code) would also > be interesting, but as you say, you might have a shortage of registers. > If you have to copy state between registers and memory in each iteration > of an 8-way loop (which it looks like you also have to do in your > current patch), that overhead cost may outweight the gains you have from > more independence in the AES rounds. 4x unrolling may not produce the best performance. I did that when I implemented this stitching function in OpenSSL and it’s in one assembly file and no functions calls outside the function. Once again, calling a function within a loop introduce a lot of overhead. Here are my past results for your reference. First one is the original performance from OpenSSL. The second one was the 4x unrolling and the third one was the 8x. But I can try again. (This was run on a p10 with 3.5 GHz machine) AES-128-GCM 382128.50k 1023073.64k 2621489.41k 3604979.37k 4018642.94k 4032080.55k AES-128-GCM 347370.13k 1236054.06k 2778748.59k 3900567.21k 4527158.61k 4579759.45k ( 4x AES and 4x ghash ) AES-128-GCM 356520.19k 989983.06k 2902907.56k 4379016.19k 5180981.25k 5249717.59k ( 8x AES and 2 4x gha sh combined) Thanks. -Danny > > Regards, > /Niels > > -- > Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. > Internet email is subject to wholesale government surveillance. _______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se