On Wed, 31 May 2023 at 18:33, Richard Henderson <richard.hender...@linaro.org> wrote: > > On 5/31/23 04:22, Ard Biesheuvel wrote: > > Use the host native instructions to implement the AES instructions > > exposed by the emulated target. The mapping is not 1:1, so it requires a > > bit of fiddling to get the right result. > > > > This is still RFC material - the current approach feels too ad-hoc, but > > given the non-1:1 correspondence, doing a proper abstraction is rather > > difficult. > > > > Changes since v1/RFC: > > - add second patch to implement x86 AES instructions on ARM hosts - this > > helps illustrate what an abstraction should cover. > > - use cpuinfo framework to detect host support for AES instructions. > > - implement ARM aesimc using x86 aesimc directly > > > > Patch #1 produces a 1.5-2x speedup in tests using the Linux kernel's > > tcrypt benchmark (mode=500) > > > > Patch #2 produces a 2-3x speedup. The discrepancy is most likely due to > > the fact that ARM uses two instructions to implement a single AES round, > > whereas x86 only uses one. > > Thanks. I spent some time yesterday looking at this, with an encrypted disk > test case and > could only measure 0.6% and 0.5% for total overhead of decrypt and encrypt > respectively. >
I don't understand what 'overhead' means in this context. Are you saying you saw barely any improvement? > > As for the design of an abstraction: I imagine we could introduce a > > host/aes.h API that implements some building blocks that the TCG helper > > implementation could use. > > Indeed. I was considering interfaces like > > /* Perform SubBytes + ShiftRows on state. */ > Int128 aesenc_SB_SR(Int128 state); > > /* Perform MixColumns on state. */ > Int128 aesenc_MC(Int128 state); > > /* Perform SubBytes + ShiftRows + MixColumns on state. */ > Int128 aesenc_SB_SR_MC(Int128 state); > > /* Perform SubBytes + ShiftRows + MixColumns + AddRoundKey. */ > Int128 aesenc_SB_SR_MC_AK(Int128 state, Int128 roundkey); > > and so forth for aesdec as well. All but aesenc_MC should be implementable > on x86 and > Power7, and all of them on aarch64. > aesenc_MC() can be implemented on x86 the way I did in patch #!, using aesdeclast+aesenc > > I suppose it really depends on whether there is a third host > > architecture that could make use of this, and how its AES instructions > > map onto the primitive AES ops above. > > There is Power6 (v{,n}cipher{,last}) and RISC-V Zkn (aes64{es,esm,ds,dsm,im}) > > I got hung up yesterday was understanding the different endian requirements > of x86 vs Power. > > ppc64: > > asm("lxvd2x 32,0,%1;" > "lxvd2x 33,0,%2;" > "vcipher 0,0,1;" > "stxvd2x 32,0,%0" > : : "r"(o), "r"(i), "r"(k), : "memory", "v0", "v1", "v2"); > > ppc64le: > > unsigned char le[16] = {8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7}; > asm("lxvd2x 32,0,%1;" > "lxvd2x 33,0,%2;" > "lxvd2x 34,0,%3;" > "vperm 0,0,0,2;" > "vperm 1,1,1,2;" > "vcipher 0,0,1;" > "vperm 0,0,0,2;" > "stxvd2x 32,0,%0" > : : "r"(o), "r"(i), "r"(k), "r"(le) : "memory", "v0", "v1", "v2"); > > There are also differences in their AES_Te* based C routines as well, which > made me wonder > if we are handling host endianness differences correctly in emulation right > now. I think > I should most definitely add some generic-ish tests for this... > The above kind of sums it up, no? Or isn't this working code?