On Mon, Jan 30, 2017 at 02:11:29PM +0000, Ard Biesheuvel wrote:
> Instead of unconditionally forcing 4 byte alignment for all generic
> chaining modes that rely on crypto_xor() or crypto_inc() (which may
> result in unnecessary copying of data when the underlying hardware
> can perform unaligned accesses efficiently), make those functions
> deal with unaligned input explicitly, but only if the Kconfig symbol
> HAVE_EFFICIENT_UNALIGNED_ACCESS is set. This will allow us to drop
> the alignmasks from the CBC, CMAC, CTR, CTS, PCBC and SEQIV drivers.
> 
> For crypto_inc(), this simply involves making the 4-byte stride
> conditional on HAVE_EFFICIENT_UNALIGNED_ACCESS being set, given that
> it typically operates on 16 byte buffers.
> 
> For crypto_xor(), an algorithm is implemented that simply runs through
> the input using the largest strides possible if unaligned accesses are
> allowed. If they are not, an optimal sequence of memory accesses is
> emitted that takes the relative alignment of the input buffers into
> account, e.g., if the relative misalignment of dst and src is 4 bytes,
> the entire xor operation will be completed using 4 byte loads and stores
> (modulo unaligned bits at the start and end). Note that all expressions
> involving startalign and misalign are simply eliminated by the compiler
> if HAVE_EFFICIENT_UNALIGNED_ACCESS is defined.
> 

Hi Ard,

This is a good idea, and I think it was error-prone to be requiring 4-byte
alignment always, and also inefficient on many architectures.

The new crypto_inc() looks fine, but the new crypto_xor() is quite complicated.
I'm wondering whether it has to be that way, especially since it seems to most
commonly be used on very small input buffers, e.g. 8 or 16-byte blocks.  There
are a couple trivial ways it could be simplified, e.g. using 'dst' and 'src'
directly instead of 'a' and 'b' (which also seems to improve code generation by
getting rid of the '+= len & ~mask' parts), or using sizeof(long) directly
instead of 'size' and 'mask'.

But also when I tried testing the proposed crypto_xor() on MIPS, it didn't work
correctly on a misaligned buffer.  With startalign=1, it did one iteration of
the following loop and then exited with startalign=0 and entered the "unsigned
long at a time" loop, which is incorrect since at that point the buffers were
not yet fully aligned:

>               do {
>                       if (len < sizeof(u8))
>                               break;
>
>                       if (len >= size && !(startalign & 1) && !(misalign & 1))
>                               break;
>
>                       *dst++ ^= *src++;
>                       len -= sizeof(u8);
>                       startalign &= ~sizeof(u8);
>               } while (misalign & 1);

I think it would need to do instead:

                startalign += sizeof(u8);
                startalign %= sizeof(unsigned long);

But I am wondering whether you considered something simpler, using the
get_unaligned/put_unaligned helpers, maybe even using a switch statement for the
last (sizeof(long) - 1) bytes so it can be compiled as a jump table.  Something
like this:

#define xor_unaligned(dst, src) \
        put_unaligned(get_unaligned(dst) ^ get_unaligned(src), (dst))

void crypto_xor(u8 *dst, const u8 *src, unsigned int len)
{
        while (len >= sizeof(unsigned long)) {
                xor_unaligned((unsigned long *)dst, (unsigned long *)src);
                dst += sizeof(unsigned long);
                src += sizeof(unsigned long);
                len -= sizeof(unsigned long);
        }

        switch (len) {
#ifdef CONFIG_64BIT
        case 7:
                dst[6] ^= src[6];
                /* fall through */
        case 6:
                xor_unaligned((u16 *)&dst[4], (u16 *)&src[4]);
                goto len_4;
        case 5:
                dst[4] ^= src[4];
                /* fall through */
        case 4:
        len_4:
                xor_unaligned((u32 *)dst, (u32 *)src);
                break;
#endif
        case 3:
                dst[2] ^= src[2];
                /* fall through */
        case 2:
                xor_unaligned((u16 *)dst, (u16 *)src);
                break;
        case 1:
                dst[0] ^= src[0];
                break;
        }
}

That would seem like a better choice for small buffers, which seems to be the
more common case.  It should generate slightly faster code on architectures with
fast unaligned access like x86_64, while still being sufficient on architectures
without --- perhaps even faster, since it wouldn't have as many branches.

Eric

Reply via email to