Danny Tsen <dt...@us.ibm.com> writes: > Interleaving at the instructions level may be a good option but due to > PPC instruction pipeline this may need to have sufficient > registers/vectors. Use same vectors to change contents in successive > instructions may require more cycles. In that case, more > vectors/scalar will get involved and all vectors assignment may have > to change. That’s the reason I avoided in this case.
To investigate the potential, I would suggest some experiments with software pipelining. Write a loop to do 4 blocks of ctr-aes128 at a time, fully unrolling the round loop. I think that should be 44 instructions of aes mangling, plus instructions to setup the counter input, and do the final xor and endianness things with the message. Arrange so that it loads the AES state in a set of registers we can call A, operating in-place on these registers. But at the end, arrange the XORing so that the final cryptotext is located in a different set of registers, B. Then, write the instructions to do ghash using the B registers as input, I think that should be about 20-25 instructions. Interleave those as well as possible with the AES instructions (say, two aes instructions, one ghash instruction, etc). Software pipelining means that each iteration of the loop does aes-ctr on four blocks, + ghash on the output for the four *previous* blocks (so one needs extra code outside of the loop to deal with first and last 4 blocks). Decrypt processing should be simpler. Then you can benchmark that loop in isolation. It doesn't need to be the complete function, the handling of first and last blocks can be omitted, and it doesn't even have to be completely correct, as long as it's the right instruction mix and the right data dependencies. The benchmark should give a good idea for the potential speedup, if any, from instruction-level interleaving. I would hope 4-way is doable with available vector registers (and this inner loop should be less than 100 instructions, so not too unmanageable). Going up to 8-way (like the current AES code) would also be interesting, but as you say, you might have a shortage of registers. If you have to copy state between registers and memory in each iteration of an 8-way loop (which it looks like you also have to do in your current patch), that overhead cost may outweight the gains you have from more independence in the AES rounds. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se