Re: [RFC PATCH] crypto: algapi - make crypto_xor() and crypto_inc() alignment agnostic
On Sun, Feb 05, 2017 at 12:10:53AM +0100, Jason A. Donenfeld wrote: > Another thing that might be helpful is that you can let gcc decide on > the alignment, and then optimize appropriately. Check out what we do > with siphash: > > https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/tree/include/linux/siphash.h#n76 > > static inline u64 siphash(const void *data, size_t len, const > siphash_key_t *key) > { > #ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > if (!IS_ALIGNED((unsigned long)data, SIPHASH_ALIGNMENT)) > return __siphash_unaligned(data, len, key); > #endif > return ___siphash_aligned(data, len, key); > } > > With this trick, we fall through to the fast alignment-assuming code, > if gcc can prove that the address is inlined. This is often the case > when passing structs, or when passing buffers that have > __aligned(BLOCKSIZE). It proves to be a very useful optimization on > some platforms. Yes, this is a good idea. Though it seems that usually at least one of the two pointers passed to crypto_xor() will have alignment unknown to the compiler, sometimes the length is constant which inlining can help a lot for. For example, if someone does crypto_xor(foo, bar, 16) on x86_64 or ARM64, we'd really like it to turn into just a few instructions like this: mov(%rsi),%rax xor%rax,(%rdi) mov0x8(%rsi),%rax xor%rax,0x8(%rdi) So how about inlining crypto_xor() if CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS or the pointers are long-aligned, otherwise calling an out-of-line function __crypto_xor_unaligned() that handles all the cases with weird alignment. Something like the following patch: (Note: exactly how __crypto_xor_unaligned() is implemented is still debatable; it could be more similar to Ard's proposal, or it could use the unaligned access helpers.) diff --git a/crypto/algapi.c b/crypto/algapi.c index df939b54b09f..a0591db3f13a 100644 --- a/crypto/algapi.c +++ b/crypto/algapi.c @@ -972,23 +972,69 @@ void crypto_inc(u8 *a, unsigned int size) } EXPORT_SYMBOL_GPL(crypto_inc); -static inline void crypto_xor_byte(u8 *a, const u8 *b, unsigned int size) +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS +void __crypto_xor_unaligned(u8 *dst, const u8 *src, unsigned int len) { - for (; size; size--) - *a++ ^= *b++; -} + unsigned long delta = (unsigned long)dst ^ (unsigned long)src; -void crypto_xor(u8 *dst, const u8 *src, unsigned int size) -{ - u32 *a = (u32 *)dst; - u32 *b = (u32 *)src; + /* Handle relative misalignment */ + if (delta % sizeof(unsigned long)) { + + /* 1-byte relative misalignment? */ + if (delta & 1) { + while (len--) + *dst++ ^= *src++; + return; + } - for (; size >= 4; size -= 4) - *a++ ^= *b++; + /* 2-byte relative misalignment? */ + if ((delta & 2) || sizeof(unsigned long) == 4) { + if ((unsigned long)dst % __alignof__(u16) && len) { + *dst++ ^= *src++; + len--; + } + while (len >= 2) { + *(u16 *)dst ^= *(u16 *)src; + dst += 2, src += 2, len -= 2; + } + if (len) + *dst ^= *src; + return; + } + + /* 4-byte relative misalignment? */ + while ((unsigned long)dst % __alignof__(u32) && len) { + *dst++ ^= *src++; + len--; + } + while (len >= 4) { + *(u32 *)dst ^= *(u32 *)src; + dst += 4, src += 4, len -= 4; + } + while (len--) + *dst++ ^= *src++; + return; + } + + /* No relative misalignment; use word accesses */ + + while ((unsigned long)dst % __alignof__(unsigned long) && len) { + *dst++ ^= *src++; + len--; + } + + while (len >= sizeof(unsigned long)) { + *(unsigned long *)dst ^= *(unsigned long *)src; + dst += sizeof(unsigned long); + src += sizeof(unsigned long); + len -= sizeof(unsigned long); + } - crypto_xor_byte((u8 *)a, (u8 *)b, size); + while (len--) + *dst++ ^= *src++; } -EXPORT_SYMBOL_GPL(crypto_xor); +EXPORT_SYMBOL_GPL(__crypto_xor_unaligned); +#endif /* !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */ unsigned int crypto_alg_extsize(struct crypto_alg *alg) { diff --git a/include/crypto/algapi.h b/include/crypto/algapi.h index 404e9558e879..718145c5eaca 100644 --- a/include/crypto/algapi.h +++ b/include/crypto/algapi.h @@ -191,9
Re: [RFC PATCH] crypto: algapi - make crypto_xor() and crypto_inc() alignment agnostic
Another thing that might be helpful is that you can let gcc decide on the alignment, and then optimize appropriately. Check out what we do with siphash: https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/tree/include/linux/siphash.h#n76 static inline u64 siphash(const void *data, size_t len, const siphash_key_t *key) { #ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS if (!IS_ALIGNED((unsigned long)data, SIPHASH_ALIGNMENT)) return __siphash_unaligned(data, len, key); #endif return ___siphash_aligned(data, len, key); } With this trick, we fall through to the fast alignment-assuming code, if gcc can prove that the address is inlined. This is often the case when passing structs, or when passing buffers that have __aligned(BLOCKSIZE). It proves to be a very useful optimization on some platforms.
Re: [RFC PATCH] crypto: algapi - make crypto_xor() and crypto_inc() alignment agnostic
Hey, On Thu, Feb 2, 2017 at 7:47 AM, Eric Biggerswrote: > I'm wondering whether it has to be that way, especially since it seems to most > commonly be used on very small input buffers, e.g. 8 or 16-byte blocks. Note that popular stream ciphers like chacha or salsa wind up XORing much longer blocks -- 64 bytes. Likewise, CTR mode tends to XOR using the block size as well. Not sure whether this is directly relavent for the decision making here, but I thought I'd mention it just in case. The XOR for this case should be _fast_, and preferably inlineable. Jason
Re: [RFC PATCH] crypto: algapi - make crypto_xor() and crypto_inc() alignment agnostic
On 2 February 2017 at 06:47, Eric Biggerswrote: > On Mon, Jan 30, 2017 at 02:11:29PM +, Ard Biesheuvel wrote: >> Instead of unconditionally forcing 4 byte alignment for all generic >> chaining modes that rely on crypto_xor() or crypto_inc() (which may >> result in unnecessary copying of data when the underlying hardware >> can perform unaligned accesses efficiently), make those functions >> deal with unaligned input explicitly, but only if the Kconfig symbol >> HAVE_EFFICIENT_UNALIGNED_ACCESS is set. This will allow us to drop >> the alignmasks from the CBC, CMAC, CTR, CTS, PCBC and SEQIV drivers. >> >> For crypto_inc(), this simply involves making the 4-byte stride >> conditional on HAVE_EFFICIENT_UNALIGNED_ACCESS being set, given that >> it typically operates on 16 byte buffers. >> >> For crypto_xor(), an algorithm is implemented that simply runs through >> the input using the largest strides possible if unaligned accesses are >> allowed. If they are not, an optimal sequence of memory accesses is >> emitted that takes the relative alignment of the input buffers into >> account, e.g., if the relative misalignment of dst and src is 4 bytes, >> the entire xor operation will be completed using 4 byte loads and stores >> (modulo unaligned bits at the start and end). Note that all expressions >> involving startalign and misalign are simply eliminated by the compiler >> if HAVE_EFFICIENT_UNALIGNED_ACCESS is defined. >> > > Hi Ard, > > This is a good idea, and I think it was error-prone to be requiring 4-byte > alignment always, and also inefficient on many architectures. > > The new crypto_inc() looks fine, but the new crypto_xor() is quite > complicated. > I'm wondering whether it has to be that way, especially since it seems to most > commonly be used on very small input buffers, e.g. 8 or 16-byte blocks. There > are a couple trivial ways it could be simplified, e.g. using 'dst' and 'src' > directly instead of 'a' and 'b' (which also seems to improve code generation > by > getting rid of the '+= len & ~mask' parts), or using sizeof(long) directly > instead of 'size' and 'mask'. > > But also when I tried testing the proposed crypto_xor() on MIPS, it didn't > work > correctly on a misaligned buffer. With startalign=1, it did one iteration of > the following loop and then exited with startalign=0 and entered the "unsigned > long at a time" loop, which is incorrect since at that point the buffers were > not yet fully aligned: > Right. I knew it was convoluted but I thought that was justified by its correctness :-) >> do { >> if (len < sizeof(u8)) >> break; >> >> if (len >= size && !(startalign & 1) && !(misalign & >> 1)) >> break; >> >> *dst++ ^= *src++; >> len -= sizeof(u8); >> startalign &= ~sizeof(u8); >> } while (misalign & 1); > > I think it would need to do instead: > > startalign += sizeof(u8); > startalign %= sizeof(unsigned long); > > But I am wondering whether you considered something simpler, using the > get_unaligned/put_unaligned helpers, maybe even using a switch statement for > the > last (sizeof(long) - 1) bytes so it can be compiled as a jump table. > Something > like this: > > #define xor_unaligned(dst, src) \ > put_unaligned(get_unaligned(dst) ^ get_unaligned(src), (dst)) > > void crypto_xor(u8 *dst, const u8 *src, unsigned int len) > { > while (len >= sizeof(unsigned long)) { > xor_unaligned((unsigned long *)dst, (unsigned long *)src); > dst += sizeof(unsigned long); > src += sizeof(unsigned long); > len -= sizeof(unsigned long); > } > > switch (len) { > #ifdef CONFIG_64BIT > case 7: > dst[6] ^= src[6]; > /* fall through */ > case 6: > xor_unaligned((u16 *)[4], (u16 *)[4]); > goto len_4; > case 5: > dst[4] ^= src[4]; > /* fall through */ > case 4: > len_4: > xor_unaligned((u32 *)dst, (u32 *)src); > break; > #endif > case 3: > dst[2] ^= src[2]; > /* fall through */ > case 2: > xor_unaligned((u16 *)dst, (u16 *)src); > break; > case 1: > dst[0] ^= src[0]; > break; > } > } > > That would seem like a better choice for small buffers, which seems to be the > more common case. It should generate slightly faster code on architectures > with > fast unaligned access like x86_64, while still being sufficient on > architectures > without --- perhaps even faster, since it wouldn't have as many branches. > Well, what I tried to deal with explicitly is
Re: [RFC PATCH] crypto: algapi - make crypto_xor() and crypto_inc() alignment agnostic
On Mon, Jan 30, 2017 at 02:11:29PM +, Ard Biesheuvel wrote: > Instead of unconditionally forcing 4 byte alignment for all generic > chaining modes that rely on crypto_xor() or crypto_inc() (which may > result in unnecessary copying of data when the underlying hardware > can perform unaligned accesses efficiently), make those functions > deal with unaligned input explicitly, but only if the Kconfig symbol > HAVE_EFFICIENT_UNALIGNED_ACCESS is set. This will allow us to drop > the alignmasks from the CBC, CMAC, CTR, CTS, PCBC and SEQIV drivers. > > For crypto_inc(), this simply involves making the 4-byte stride > conditional on HAVE_EFFICIENT_UNALIGNED_ACCESS being set, given that > it typically operates on 16 byte buffers. > > For crypto_xor(), an algorithm is implemented that simply runs through > the input using the largest strides possible if unaligned accesses are > allowed. If they are not, an optimal sequence of memory accesses is > emitted that takes the relative alignment of the input buffers into > account, e.g., if the relative misalignment of dst and src is 4 bytes, > the entire xor operation will be completed using 4 byte loads and stores > (modulo unaligned bits at the start and end). Note that all expressions > involving startalign and misalign are simply eliminated by the compiler > if HAVE_EFFICIENT_UNALIGNED_ACCESS is defined. > Hi Ard, This is a good idea, and I think it was error-prone to be requiring 4-byte alignment always, and also inefficient on many architectures. The new crypto_inc() looks fine, but the new crypto_xor() is quite complicated. I'm wondering whether it has to be that way, especially since it seems to most commonly be used on very small input buffers, e.g. 8 or 16-byte blocks. There are a couple trivial ways it could be simplified, e.g. using 'dst' and 'src' directly instead of 'a' and 'b' (which also seems to improve code generation by getting rid of the '+= len & ~mask' parts), or using sizeof(long) directly instead of 'size' and 'mask'. But also when I tried testing the proposed crypto_xor() on MIPS, it didn't work correctly on a misaligned buffer. With startalign=1, it did one iteration of the following loop and then exited with startalign=0 and entered the "unsigned long at a time" loop, which is incorrect since at that point the buffers were not yet fully aligned: > do { > if (len < sizeof(u8)) > break; > > if (len >= size && !(startalign & 1) && !(misalign & 1)) > break; > > *dst++ ^= *src++; > len -= sizeof(u8); > startalign &= ~sizeof(u8); > } while (misalign & 1); I think it would need to do instead: startalign += sizeof(u8); startalign %= sizeof(unsigned long); But I am wondering whether you considered something simpler, using the get_unaligned/put_unaligned helpers, maybe even using a switch statement for the last (sizeof(long) - 1) bytes so it can be compiled as a jump table. Something like this: #define xor_unaligned(dst, src) \ put_unaligned(get_unaligned(dst) ^ get_unaligned(src), (dst)) void crypto_xor(u8 *dst, const u8 *src, unsigned int len) { while (len >= sizeof(unsigned long)) { xor_unaligned((unsigned long *)dst, (unsigned long *)src); dst += sizeof(unsigned long); src += sizeof(unsigned long); len -= sizeof(unsigned long); } switch (len) { #ifdef CONFIG_64BIT case 7: dst[6] ^= src[6]; /* fall through */ case 6: xor_unaligned((u16 *)[4], (u16 *)[4]); goto len_4; case 5: dst[4] ^= src[4]; /* fall through */ case 4: len_4: xor_unaligned((u32 *)dst, (u32 *)src); break; #endif case 3: dst[2] ^= src[2]; /* fall through */ case 2: xor_unaligned((u16 *)dst, (u16 *)src); break; case 1: dst[0] ^= src[0]; break; } } That would seem like a better choice for small buffers, which seems to be the more common case. It should generate slightly faster code on architectures with fast unaligned access like x86_64, while still being sufficient on architectures without --- perhaps even faster, since it wouldn't have as many branches. Eric