Re: ppc64: AES/GCM Performance improvement with stitched implementation

Danny Tsen Wed, 22 Nov 2023 07:37:50 -0800


> On Nov 22, 2023, at 2:27 AM, Niels Möller <ni...@lysator.liu.se> wrote:
> 
> Danny Tsen <dt...@us.ibm.com> writes:
> 
>> Interleaving at the instructions level may be a good option but due to
>> PPC instruction pipeline this may need to have sufficient
>> registers/vectors. Use same vectors to change contents in successive
>> instructions may require more cycles. In that case, more
>> vectors/scalar will get involved and all vectors assignment may have
>> to change. That’s the reason I avoided in this case.
> 
> To investigate the potential, I would suggest some experiments with
> software pipelining.
> 
> Write a loop to do 4 blocks of ctr-aes128 at a time, fully unrolling the
> round loop. I think that should be 44 instructions of aes mangling, plus
> instructions to setup the counter input, and do the final xor and
> endianness things with the message. Arrange so that it loads the AES
> state in a set of registers we can call A, operating in-place on these
> registers. But at the end, arrange the XORing so that the final
> cryptotext is located in a different set of registers, B.
> 
> Then, write the instructions to do ghash using the B registers as input,
> I think that should be about 20-25 instructions. Interleave those as
> well as possible with the AES instructions (say, two aes instructions,
> one ghash instruction, etc).
> 
> Software pipelining means that each iteration of the loop does aes-ctr
> on four blocks, + ghash on the output for the four *previous* blocks (so
> one needs extra code outside of the loop to deal with first and last 4
> blocks). Decrypt processing should be simpler.
> 
> Then you can benchmark that loop in isolation. It doesn't need to be the
> complete function, the handling of first and last blocks can be omitted,
> and it doesn't even have to be completely correct, as long as it's the
> right instruction mix and the right data dependencies. The benchmark
> should give a good idea for the potential speedup, if any, from
> instruction-level interleaving.
This is a very ideal condition.  Too much interleaving may not produce the best 
results and different architectures may have different results.  I had tried 
various way when I implemented AES/GCM stitching functions for OpenSSL.  I’ll 
give it a try since your ghash function is different.


> 
> I would hope 4-way is doable with available vector registers (and this
> inner loop should be less than 100 instructions, so not too
> unmanageable). Going up to 8-way (like the current AES code) would also
> be interesting, but as you say, you might have a shortage of registers.
> If you have to copy state between registers and memory in each iteration
> of an 8-way loop (which it looks like you also have to do in your
> current patch), that overhead cost may outweight the gains you have from
> more independence in the AES rounds.
4x unrolling may not produce the best performance.  I did that when I 
implemented this stitching function in OpenSSL and it’s in one assembly file 
and no functions calls outside the function.  Once again, calling a function 
within a loop introduce a lot of overhead.  Here are my past results for your 
reference.  First one is the original performance from OpenSSL.  The second one 
was the 4x unrolling and the third one was the 8x.  But I can try again.

(This was run on a p10 with 3.5 GHz machine)

        AES-128-GCM     382128.50k  1023073.64k  2621489.41k  3604979.37k  
4018642.94k  4032080.55k
        AES-128-GCM     347370.13k  1236054.06k  2778748.59k  3900567.21k  
4527158.61k  4579759.45k ( 4x AES and 4x ghash
)
        AES-128-GCM     356520.19k   989983.06k  2902907.56k  4379016.19k  
5180981.25k  5249717.59k ( 8x AES and 2 4x gha
sh combined)

Thanks.
-Danny

> 
> Regards,
> /Niels
> 
> -- 
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.

_______________________________________________
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

Re: ppc64: AES/GCM Performance improvement with stitched implementation

Reply via email to