Any comments on that?

In one word "no-o-o-o-o-o-o". :-) In more words. Preferred way to integrate processor-specific code is plotted in Intel AES-NI and SPARC T4 modules. And "preferred" does not really mean "matter of choice". [s390x module is usually mentioned in the context, and the answer is I wish I had time to do something about it.]

This patch series adds the initial support for POWER8 new cryptographic
instructions.

Different versions of the ppc_vcipher_AES_[en|de]crypt were tested and
no significant performance gains where found, even using multiple vector
registers to load all sub-keys in advance.

You naturally won't observe difference in single-block function. Because all instructions are high latency and are dependent on each other, so there is a lot of "free slots" to execute all the collateral instructions. While it's not self-obvious that gain from pre-loading key schedule can be observed in single-threaded benchmark even in code with interleaved instructions in parallelizeable modes, there might be other factors to consider. The POWER8 processor is SMT (right?), and it should be advantageous to pre-load for stream operations, so that there is more memory bus bandwidth available to the other threads. Or it might be more appropriate to use the "free slots" [which will be less numerous in parallelizable modes] for other things, for example maintaining counter values in CTR...

Because of that, the version
included in this series was chosen based on readability.

Why not folded loop then?

The performance
gain is about 5x in a non-final hardware.

More important question is what is theoretical asymptotic limit, how far are we from it and how to get there. Well, answer is naturally mode-specific subroutines, but it doesn't change the point. One should discuss even absolute numbers, not only relative improvement.

The patch "perlasm/ppc-xlate.pl: vcipher instructions support" is not
necessary for newer versions of GCC and I'd like to hear opinions if
it's worth to include it or not.

Absolutely. And it applies to all new instructions. One can choose to implement module-specific instructions in module itself and common ones in ppc-xlate, e.g. vcipher in AES module and ldxvd2x in ppc-xlate.

Feel free to ask me any questions regarding the code.

Doesn't one need to take care of vrsave? If it's not required on Linux, is it required elsewhere? [It was required on MacOS X].

Is presented code endian-neutral? Manual doesn't discuss endianness in vcipher context, so I assume that instruction operation does not depend on current endianness. Which would require split endian operation for loading data, I assume in little-endian mode.

As for ld/stxvd2x for data. Manual "threatens" with penalties on cache line and page boundaries, and it doesn't seem to actually make promise that it always works with byte alignment across page boundaries. Yes, OS surely handles it by serving the exception, but we don't want it to happen. Wouldn't it be more appropriate to adhere to l/stvx? [See just committed vpaes-ppc.pl module for example.]

As for page boundaries in ld/stxvd2x. Key schedule is aligned at 64 bits (in e_aes.c) and this doesn't preclude possibility for a ld/stxvd2x to cross page boundary. And if there is penalty, it might get costly [because of recurring nature of references to key schedule]. Should one consider lvx even for key schedule?

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-dev@openssl.org
Automated List Manager                           majord...@openssl.org

Reply via email to