Use the host native instructions to implement the AES instructions exposed by the emulated target. The mapping is not 1:1, so it requires a bit of fiddling to get the right result.
This is still RFC material - the current approach feels too ad-hoc, but given the non-1:1 correspondence, doing a proper abstraction is rather difficult. Changes since v1/RFC: - add second patch to implement x86 AES instructions on ARM hosts - this helps illustrate what an abstraction should cover. - use cpuinfo framework to detect host support for AES instructions. - implement ARM aesimc using x86 aesimc directly Patch #1 produces a 1.5-2x speedup in tests using the Linux kernel's tcrypt benchmark (mode=500) Patch #2 produces a 2-3x speedup. The discrepancy is most likely due to the fact that ARM uses two instructions to implement a single AES round, whereas x86 only uses one. Note that using the ARM intrinsics is fiddly with Clang, as it does not declare the prototypes unless some builtin CPP macro (__ARM_FEATURE_AES) is defined, which will be set by the compiler based on the command line arch/cpu options. However, setting this globally for a compilation unit is dubious, given that we test cpuinfo for AES support, and only emit the instructions conditionally. So I used inline asm() instead. As for the design of an abstraction: I imagine we could introduce a host/aes.h API that implements some building blocks that the TCG helper implementation could use. Quoting from my reply to Richard: Using the primitive operations defined in the AES paper, we basically perform the following transformation for n rounds of AES (for n in {10, 12, 14}) for (n-1 rounds) { AddRoundKey ShiftRows SubBytes MixColumns } AddRoundKey ShiftRows SubBytes AddRoundKey AddRoundKey is just XOR, but it is incorporated into the instructions that combine a couple of these steps. So on x86, we have aesenc: ShiftRows SubBytes MixColumns AddRoundKey aesenclast: ShiftRows SubBytes AddRoundKey and on ARM we have aese: AddRoundKey ShiftRows SubBytes aesmc: MixColumns So a generic routine that does only ShiftRows+SubBytes could be backed by x86's aesenclast and ARM's aese, using a NULL round key argument in each case. Then, it would be up to the TCG helper code for either ARM or x86 to incorporate those routines in the right way. I suppose it really depends on whether there is a third host architecture that could make use of this, and how its AES instructions map onto the primitive AES ops above. Cc: Peter Maydell <peter.mayd...@linaro.org> Cc: Alex Bennée <alex.ben...@linaro.org> Cc: Richard Henderson <richard.hender...@linaro.org> Cc: Philippe Mathieu-Daudé <f4...@amsat.org> Ard Biesheuvel (2): target/arm: use x86 intrinsics to implement AES instructions target/i386: Implement AES instructions using AArch64 counterparts host/include/aarch64/host/cpuinfo.h | 1 + host/include/i386/host/cpuinfo.h | 1 + target/arm/tcg/crypto_helper.c | 37 ++++++++++- target/i386/ops_sse.h | 69 ++++++++++++++++++++ util/cpuinfo-aarch64.c | 1 + util/cpuinfo-i386.c | 1 + 6 files changed, 107 insertions(+), 3 deletions(-) -- 2.39.2