On 5/31/23 04:22, Ard Biesheuvel wrote:
Use the host native instructions to implement the AES instructions
exposed by the emulated target. The mapping is not 1:1, so it requires a
bit of fiddling to get the right result.
This is still RFC material - the current approach feels too ad-hoc, but
given the non-1:1 correspondence, doing a proper abstraction is rather
difficult.
Changes since v1/RFC:
- add second patch to implement x86 AES instructions on ARM hosts - this
helps illustrate what an abstraction should cover.
- use cpuinfo framework to detect host support for AES instructions.
- implement ARM aesimc using x86 aesimc directly
Patch #1 produces a 1.5-2x speedup in tests using the Linux kernel's
tcrypt benchmark (mode=500)
Patch #2 produces a 2-3x speedup. The discrepancy is most likely due to
the fact that ARM uses two instructions to implement a single AES round,
whereas x86 only uses one.
Thanks. I spent some time yesterday looking at this, with an encrypted disk test case and
could only measure 0.6% and 0.5% for total overhead of decrypt and encrypt respectively.
As for the design of an abstraction: I imagine we could introduce a
host/aes.h API that implements some building blocks that the TCG helper
implementation could use.
Indeed. I was considering interfaces like
/* Perform SubBytes + ShiftRows on state. */
Int128 aesenc_SB_SR(Int128 state);
/* Perform MixColumns on state. */
Int128 aesenc_MC(Int128 state);
/* Perform SubBytes + ShiftRows + MixColumns on state. */
Int128 aesenc_SB_SR_MC(Int128 state);
/* Perform SubBytes + ShiftRows + MixColumns + AddRoundKey. */
Int128 aesenc_SB_SR_MC_AK(Int128 state, Int128 roundkey);
and so forth for aesdec as well. All but aesenc_MC should be implementable on x86 and
Power7, and all of them on aarch64.
I suppose it really depends on whether there is a third host
architecture that could make use of this, and how its AES instructions
map onto the primitive AES ops above.
There is Power6 (v{,n}cipher{,last}) and RISC-V Zkn (aes64{es,esm,ds,dsm,im})
I got hung up yesterday was understanding the different endian requirements of
x86 vs Power.
ppc64:
asm("lxvd2x 32,0,%1;"
"lxvd2x 33,0,%2;"
"vcipher 0,0,1;"
"stxvd2x 32,0,%0"
: : "r"(o), "r"(i), "r"(k), : "memory", "v0", "v1", "v2");
ppc64le:
unsigned char le[16] = {8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7};
asm("lxvd2x 32,0,%1;"
"lxvd2x 33,0,%2;"
"lxvd2x 34,0,%3;"
"vperm 0,0,0,2;"
"vperm 1,1,1,2;"
"vcipher 0,0,1;"
"vperm 0,0,0,2;"
"stxvd2x 32,0,%0"
: : "r"(o), "r"(i), "r"(k), "r"(le) : "memory", "v0", "v1", "v2");
There are also differences in their AES_Te* based C routines as well, which made me wonder
if we are handling host endianness differences correctly in emulation right now. I think
I should most definitely add some generic-ish tests for this...
r~