Re: ppc64: AES/GCM Performance improvement with stitched implementation

David Edelsohn Wed, 22 Nov 2023 08:30:52 -0800

On Wed, Nov 22, 2023 at 10:37 AM Danny Tsen <dt...@us.ibm.com> wrote:


>
>
> > On Nov 22, 2023, at 2:27 AM, Niels Möller <ni...@lysator.liu.se> wrote:
> >
> > Danny Tsen <dt...@us.ibm.com> writes:
> >
> >> Interleaving at the instructions level may be a good option but due to
> >> PPC instruction pipeline this may need to have sufficient
> >> registers/vectors. Use same vectors to change contents in successive
> >> instructions may require more cycles. In that case, more
> >> vectors/scalar will get involved and all vectors assignment may have
> >> to change. That’s the reason I avoided in this case.
> >
> > To investigate the potential, I would suggest some experiments with
> > software pipelining.
> >
> > Write a loop to do 4 blocks of ctr-aes128 at a time, fully unrolling the
> > round loop. I think that should be 44 instructions of aes mangling, plus
> > instructions to setup the counter input, and do the final xor and
> > endianness things with the message. Arrange so that it loads the AES
> > state in a set of registers we can call A, operating in-place on these
> > registers. But at the end, arrange the XORing so that the final
> > cryptotext is located in a different set of registers, B.
> >
> > Then, write the instructions to do ghash using the B registers as input,
> > I think that should be about 20-25 instructions. Interleave those as
> > well as possible with the AES instructions (say, two aes instructions,
> > one ghash instruction, etc).
> >
> > Software pipelining means that each iteration of the loop does aes-ctr
> > on four blocks, + ghash on the output for the four *previous* blocks (so
> > one needs extra code outside of the loop to deal with first and last 4
> > blocks). Decrypt processing should be simpler.
> >
> > Then you can benchmark that loop in isolation. It doesn't need to be the
> > complete function, the handling of first and last blocks can be omitted,
> > and it doesn't even have to be completely correct, as long as it's the
> > right instruction mix and the right data dependencies. The benchmark
> > should give a good idea for the potential speedup, if any, from
> > instruction-level interleaving.
> This is a very ideal condition.  Too much interleaving may not produce the
> best results and different architectures may have different results.  I had
> tried various way when I implemented AES/GCM stitching functions for
> OpenSSL.  I’ll give it a try since your ghash function is different.
>
> >
> > I would hope 4-way is doable with available vector registers (and this
> > inner loop should be less than 100 instructions, so not too
> > unmanageable). Going up to 8-way (like the current AES code) would also
> > be interesting, but as you say, you might have a shortage of registers.
> > If you have to copy state between registers and memory in each iteration
> > of an 8-way loop (which it looks like you also have to do in your
> > current patch), that overhead cost may outweight the gains you have from
> > more independence in the AES rounds.
> 4x unrolling may not produce the best performance.  I did that when I
> implemented this stitching function in OpenSSL and it’s in one assembly
> file and no functions calls outside the function.  Once again, calling a
> function within a loop introduce a lot of overhead.  Here are my past
> results for your reference.  First one is the original performance from
> OpenSSL.  The second one was the 4x unrolling and the third one was the
> 8x.  But I can try again.
>
> (This was run on a p10 with 3.5 GHz machine)
>
>         AES-128-GCM     382128.50k  1023073.64k  2621489.41k  3604979.37k
> 4018642.94k  4032080.55k
>         AES-128-GCM     347370.13k  1236054.06k  2778748.59k  3900567.21k
> 4527158.61k  4579759.45k ( 4x AES and 4x ghash
> )
>         AES-128-GCM     356520.19k   989983.06k  2902907.56k  4379016.19k
> 5180981.25k  5249717.59k ( 8x AES and 2 4x gha
> sh combined)
>

Calls impose a lot of overhead on Power.

And both the efficient loop instruction and the preferred indirect call
instruction use the CTR register.

Thanks, David


>
> Thanks.
> -Danny
>
> >
> > Regards,
> > /Niels
> >
> > --
> > Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> > Internet email is subject to wholesale government surveillance.
>
> _______________________________________________
> nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
> To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
>
_______________________________________________
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

Re: ppc64: AES/GCM Performance improvement with stitched implementation

Reply via email to