Danny Tsen <dt...@us.ibm.com> writes:

> Interleaving at the instructions level may be a good option but due to
> PPC instruction pipeline this may need to have sufficient
> registers/vectors. Use same vectors to change contents in successive
> instructions may require more cycles. In that case, more
> vectors/scalar will get involved and all vectors assignment may have
> to change. That’s the reason I avoided in this case.

To investigate the potential, I would suggest some experiments with
software pipelining.

Write a loop to do 4 blocks of ctr-aes128 at a time, fully unrolling the
round loop. I think that should be 44 instructions of aes mangling, plus
instructions to setup the counter input, and do the final xor and
endianness things with the message. Arrange so that it loads the AES
state in a set of registers we can call A, operating in-place on these
registers. But at the end, arrange the XORing so that the final
cryptotext is located in a different set of registers, B.

Then, write the instructions to do ghash using the B registers as input,
I think that should be about 20-25 instructions. Interleave those as
well as possible with the AES instructions (say, two aes instructions,
one ghash instruction, etc).

Software pipelining means that each iteration of the loop does aes-ctr
on four blocks, + ghash on the output for the four *previous* blocks (so
one needs extra code outside of the loop to deal with first and last 4
blocks). Decrypt processing should be simpler.

Then you can benchmark that loop in isolation. It doesn't need to be the
complete function, the handling of first and last blocks can be omitted,
and it doesn't even have to be completely correct, as long as it's the
right instruction mix and the right data dependencies. The benchmark
should give a good idea for the potential speedup, if any, from
instruction-level interleaving.

I would hope 4-way is doable with available vector registers (and this
inner loop should be less than 100 instructions, so not too
unmanageable). Going up to 8-way (like the current AES code) would also
be interesting, but as you say, you might have a shortage of registers.
If you have to copy state between registers and memory in each iteration
of an 8-way loop (which it looks like you also have to do in your
current patch), that overhead cost may outweight the gains you have from
more independence in the AES rounds.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

Reply via email to