Re: [PATCH v3] crypto: algapi - make crypto_xor() and crypto_inc() alignment agnostic

2017-02-14 Thread Ard Biesheuvel
On 13 February 2017 at 21:55, Jason A. Donenfeld  wrote:
> On Sun, Feb 5, 2017 at 11:06 AM, Ard Biesheuvel
>  wrote:
>> +   if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||
>> +   !((unsigned long)b & (__alignof__(*b) - 1)))
>
> Why not simply use the IS_ALIGNED macro?
>

Good point.

> Also, are you might consider checking to see if this is a constant, so
> that you can avoid an unnecessary branch.

Check if what is a constant? The buffer address?

> Alternatively, if you want
> to use the branch, I'd be interested in you writing back saying, "I
> tested both cases, and branching is faster than always using the slow
> unaligned path."

When using the 4-byte wide path, the loop terminates after 1 iteration
unless there is a carry in the low word. The likelihood of the loop
iterating multiple times in the 1-byte wide path is 1 in 256.

I suppose we could add this if there is concern about the branching

diff --git a/crypto/algapi.c b/crypto/algapi.c
index 6b52e8f0b95f..03670390a2e4 100644
--- a/crypto/algapi.c
+++ b/crypto/algapi.c
@@ -967,7 +967,7 @@ void crypto_inc(u8 *a, unsigned int size)
for (; size >= 4; size -= 4) {
c = be32_to_cpu(*--b) + 1;
*b = cpu_to_be32(c);
-   if (c)
+   if (likely(c))
return;
}

but other than that, I see little reason to introduce complicated logic here.


>> +   while (((unsigned long)dst & (relalign - 1)) && len > 0) {
>
> IS_ALIGNED
>
>> +static inline void crypto_xor(u8 *dst, const u8 *src, unsigned int size)
>> +{
>> +   if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
>> +   __builtin_constant_p(size) &&
>> +   (size % sizeof(unsigned long)) == 0) {
>
> You can expand this condition to be:
>
> if ( (is_constant(size) && size%sizeof(ulong)==0) &&
> (efficient_unaligned || (is_constant(dst) && is_constant(src) &&
> is_aligned(dst) && is_aligned(src))) )
>
> It might seem complex, but it all gets compiled out.

Care to explain how? Could you point me to any references to
crypto_xor() where either of the buffer addresses are compile time
(not link time) constants?


Re: [PATCH v3] crypto: algapi - make crypto_xor() and crypto_inc() alignment agnostic

2017-02-13 Thread Jason A. Donenfeld
On Sun, Feb 5, 2017 at 11:06 AM, Ard Biesheuvel
 wrote:
> +   if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||
> +   !((unsigned long)b & (__alignof__(*b) - 1)))

Why not simply use the IS_ALIGNED macro?

Also, are you might consider checking to see if this is a constant, so
that you can avoid an unnecessary branch. Alternatively, if you want
to use the branch, I'd be interested in you writing back saying, "I
tested both cases, and branching is faster than always using the slow
unaligned path."
> +   while (((unsigned long)dst & (relalign - 1)) && len > 0) {

IS_ALIGNED

> +static inline void crypto_xor(u8 *dst, const u8 *src, unsigned int size)
> +{
> +   if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
> +   __builtin_constant_p(size) &&
> +   (size % sizeof(unsigned long)) == 0) {

You can expand this condition to be:

if ( (is_constant(size) && size%sizeof(ulong)==0) &&
(efficient_unaligned || (is_constant(dst) && is_constant(src) &&
is_aligned(dst) && is_aligned(src))) )

It might seem complex, but it all gets compiled out.


Re: [PATCH v3] crypto: algapi - make crypto_xor() and crypto_inc() alignment agnostic

2017-02-11 Thread Herbert Xu
On Sun, Feb 05, 2017 at 10:06:12AM +, Ard Biesheuvel wrote:
> Instead of unconditionally forcing 4 byte alignment for all generic
> chaining modes that rely on crypto_xor() or crypto_inc() (which may
> result in unnecessary copying of data when the underlying hardware
> can perform unaligned accesses efficiently), make those functions
> deal with unaligned input explicitly, but only if the Kconfig symbol
> HAVE_EFFICIENT_UNALIGNED_ACCESS is set. This will allow us to drop
> the alignmasks from the CBC, CMAC, CTR, CTS, PCBC and SEQIV drivers.
> 
> For crypto_inc(), this simply involves making the 4-byte stride
> conditional on HAVE_EFFICIENT_UNALIGNED_ACCESS being set, given that
> it typically operates on 16 byte buffers.
> 
> For crypto_xor(), an algorithm is implemented that simply runs through
> the input using the largest strides possible if unaligned accesses are
> allowed. If they are not, an optimal sequence of memory accesses is
> emitted that takes the relative alignment of the input buffers into
> account, e.g., if the relative misalignment of dst and src is 4 bytes,
> the entire xor operation will be completed using 4 byte loads and stores
> (modulo unaligned bits at the start and end). Note that all expressions
> involving misalign are simply eliminated by the compiler when
> HAVE_EFFICIENT_UNALIGNED_ACCESS is defined.
> 
> Signed-off-by: Ard Biesheuvel 

Patch applied.  Thanks.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


[PATCH v3] crypto: algapi - make crypto_xor() and crypto_inc() alignment agnostic

2017-02-05 Thread Ard Biesheuvel
Instead of unconditionally forcing 4 byte alignment for all generic
chaining modes that rely on crypto_xor() or crypto_inc() (which may
result in unnecessary copying of data when the underlying hardware
can perform unaligned accesses efficiently), make those functions
deal with unaligned input explicitly, but only if the Kconfig symbol
HAVE_EFFICIENT_UNALIGNED_ACCESS is set. This will allow us to drop
the alignmasks from the CBC, CMAC, CTR, CTS, PCBC and SEQIV drivers.

For crypto_inc(), this simply involves making the 4-byte stride
conditional on HAVE_EFFICIENT_UNALIGNED_ACCESS being set, given that
it typically operates on 16 byte buffers.

For crypto_xor(), an algorithm is implemented that simply runs through
the input using the largest strides possible if unaligned accesses are
allowed. If they are not, an optimal sequence of memory accesses is
emitted that takes the relative alignment of the input buffers into
account, e.g., if the relative misalignment of dst and src is 4 bytes,
the entire xor operation will be completed using 4 byte loads and stores
(modulo unaligned bits at the start and end). Note that all expressions
involving misalign are simply eliminated by the compiler when
HAVE_EFFICIENT_UNALIGNED_ACCESS is defined.

Signed-off-by: Ard Biesheuvel 
---
v3: fix thinko in processing of unaligned leading chunk
inline common case where the input size is a constant multiple of the word
size on architectures with h/w handling of unaligned accesses

 crypto/algapi.c | 68 ++--
 crypto/cbc.c|  3 -
 crypto/cmac.c   |  3 +-
 crypto/ctr.c|  2 +-
 crypto/cts.c|  3 -
 crypto/pcbc.c   |  3 -
 crypto/seqiv.c  |  2 -
 include/crypto/algapi.h | 20 +-
 8 files changed, 70 insertions(+), 34 deletions(-)

diff --git a/crypto/algapi.c b/crypto/algapi.c
index 1fad2a6b3bbb..6b52e8f0b95f 100644
--- a/crypto/algapi.c
+++ b/crypto/algapi.c
@@ -962,34 +962,66 @@ void crypto_inc(u8 *a, unsigned int size)
__be32 *b = (__be32 *)(a + size);
u32 c;
 
-   for (; size >= 4; size -= 4) {
-   c = be32_to_cpu(*--b) + 1;
-   *b = cpu_to_be32(c);
-   if (c)
-   return;
-   }
+   if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||
+   !((unsigned long)b & (__alignof__(*b) - 1)))
+   for (; size >= 4; size -= 4) {
+   c = be32_to_cpu(*--b) + 1;
+   *b = cpu_to_be32(c);
+   if (c)
+   return;
+   }
 
crypto_inc_byte(a, size);
 }
 EXPORT_SYMBOL_GPL(crypto_inc);
 
-static inline void crypto_xor_byte(u8 *a, const u8 *b, unsigned int size)
+void __crypto_xor(u8 *dst, const u8 *src, unsigned int len)
 {
-   for (; size; size--)
-   *a++ ^= *b++;
-}
+   int relalign = 0;
+
+   if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+   int size = sizeof(unsigned long);
+   int d = ((unsigned long)dst ^ (unsigned long)src) & (size - 1);
+
+   relalign = d ? 1 << __ffs(d) : size;
+
+   /*
+* If we care about alignment, process as many bytes as
+* needed to advance dst and src to values whose alignments
+* equal their relative alignment. This will allow us to
+* process the remainder of the input using optimal strides.
+*/
+   while (((unsigned long)dst & (relalign - 1)) && len > 0) {
+   *dst++ ^= *src++;
+   len--;
+   }
+   }
 
-void crypto_xor(u8 *dst, const u8 *src, unsigned int size)
-{
-   u32 *a = (u32 *)dst;
-   u32 *b = (u32 *)src;
+   while (IS_ENABLED(CONFIG_64BIT) && len >= 8 && !(relalign & 7)) {
+   *(u64 *)dst ^= *(u64 *)src;
+   dst += 8;
+   src += 8;
+   len -= 8;
+   }
 
-   for (; size >= 4; size -= 4)
-   *a++ ^= *b++;
+   while (len >= 4 && !(relalign & 3)) {
+   *(u32 *)dst ^= *(u32 *)src;
+   dst += 4;
+   src += 4;
+   len -= 4;
+   }
+
+   while (len >= 2 && !(relalign & 1)) {
+   *(u16 *)dst ^= *(u16 *)src;
+   dst += 2;
+   src += 2;
+   len -= 2;
+   }
 
-   crypto_xor_byte((u8 *)a, (u8 *)b, size);
+   while (len--)
+   *dst++ ^= *src++;
 }
-EXPORT_SYMBOL_GPL(crypto_xor);
+EXPORT_SYMBOL_GPL(__crypto_xor);
 
 unsigned int crypto_alg_extsize(struct crypto_alg *alg)
 {
diff --git a/crypto/cbc.c b/crypto/cbc.c
index 68f751a41a84..bc160a3186dc 100644
--- a/crypto/cbc.c
+++ b/crypto/cbc.c
@@ -145,9 +145,6 @@ static int crypto_cbc_create(struct crypto_template *tmpl, 
struct rtattr **tb)
inst->alg.base.cra_blocksize = alg->cra_blocksize;