Re: [PATCH 0/4] Initial POWER8 support

Andy Polyakov Thu, 28 Nov 2013 00:14:37 -0800

Any comments on that?

In one word "no-o-o-o-o-o-o". :-) In more words. Preferred way tointegrate processor-specific code is plotted in Intel AES-NI and SPARCT4 modules. And "preferred" does not really mean "matter of choice".[s390x module is usually mentioned in the context, and the answer is Iwish I had time to do something about it.]

This patch series adds the initial support for POWER8 new cryptographic
instructions.

Different versions of the ppc_vcipher_AES_[en|de]crypt were tested and
no significant performance gains where found, even using multiple vector
registers to load all sub-keys in advance.

You naturally won't observe difference in single-block function. Becauseall instructions are high latency and are dependent on each other, sothere is a lot of "free slots" to execute all the collateralinstructions. While it's not self-obvious that gain from pre-loading keyschedule can be observed in single-threaded benchmark even in code withinterleaved instructions in parallelizeable modes, there might be otherfactors to consider. The POWER8 processor is SMT (right?), and it shouldbe advantageous to pre-load for stream operations, so that there is morememory bus bandwidth available to the other threads. Or it might be moreappropriate to use the "free slots" [which will be less numerous inparallelizable modes] for other things, for example maintaining countervalues in CTR...

Because of that, the version
included in this series was chosen based on readability.


Why not folded loop then?

The performance
gain is about 5x in a non-final hardware.

More important question is what is theoretical asymptotic limit, how farare we from it and how to get there. Well, answer is naturallymode-specific subroutines, but it doesn't change the point. One shoulddiscuss even absolute numbers, not only relative improvement.

The patch "perlasm/ppc-xlate.pl: vcipher instructions support" is not
necessary for newer versions of GCC and I'd like to hear opinions if
it's worth to include it or not.

Absolutely. And it applies to all new instructions. One can choose toimplement module-specific instructions in module itself and common onesin ppc-xlate, e.g. vcipher in AES module and ldxvd2x in ppc-xlate.

Feel free to ask me any questions regarding the code.

Doesn't one need to take care of vrsave? If it's not required on Linux,is it required elsewhere? [It was required on MacOS X].

Is presented code endian-neutral? Manual doesn't discuss endianness invcipher context, so I assume that instruction operation does not dependon current endianness. Which would require split endian operation forloading data, I assume in little-endian mode.

As for ld/stxvd2x for data. Manual "threatens" with penalties on cacheline and page boundaries, and it doesn't seem to actually make promisethat it always works with byte alignment across page boundaries. Yes, OSsurely handles it by serving the exception, but we don't want it tohappen. Wouldn't it be more appropriate to adhere to l/stvx? [See justcommitted vpaes-ppc.pl module for example.]

As for page boundaries in ld/stxvd2x. Key schedule is aligned at 64 bits(in e_aes.c) and this doesn't preclude possibility for a ld/stxvd2x tocross page boundary. And if there is penalty, it might get costly[because of recurring nature of references to key schedule]. Should oneconsider lvx even for key schedule?


______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-dev@openssl.org
Automated List Manager                           majord...@openssl.org

Re: [PATCH 0/4] Initial POWER8 support

Reply via email to