On 8 October 2014 00:48, John-Mark Gurney <j...@funkthat.com> wrote:
> Christian Weisgerber wrote this message on Tue, Oct 07, 2014 at 23:08 +0200:
>> John-Mark Gurney:
>>
>> > So, as I was working on FreeBSD's implementation of gmac.c, I noticed
>> > that I was able to get a significant speed up by using a mask instead
>> > of an if branch in ghash_gfmul in gmac.c from OpenBSD...
>> >
>> > Add a mask var and replace the code between the comments
>> > "update Z" and "update V" w/:
>> >                 mask = !!(x[i >> 3] & (1 << (~i & 7)));
>> >                 mask = ~(mask - 1);
>> >
>> >                 z[0] ^= v[0] & mask;
>> >                 z[1] ^= v[1] & mask;
>> >                 z[2] ^= v[2] & mask;
>> >                 z[3] ^= v[3] & mask;
>> >
>> > And you should see a nice performance increase...
>>
>> I tried this on a Soekris net6501-50 and the performance increase
>> was around 1.3%.  (I set up an ESP transport association with
>> AES-128-GMAC and pushed UDP traffic with tcpbench over it.)
>
> Yeh, I benchmarked the raw algo in userland, not part of IPsec.  I
> forget the resulting perf increase, but it was well more than 10-20%.
>
>> A look at the generated amd64 assembly code shows that the change
>> indeed removes a branch.  What's pretty shocking is that this code
>>
>>     mul = v[3] & 1;
>>     ...
>>     v[0] = (v[0] >> 1) ^ (0xe1000000 * mul);
>>
>> is turned into an actual imul instruction by GCC.  I used the same
>> masking approach to get rid of the multiplication, but the improvement
>> was minuscule (<1%).
>
> Hmm. interesting...  In my code I switched both to using the and
> operator...
>
> I also have code which switches the registers to 64bits so that on
> amd64, we make better uses of registers, and on i386, the compilers
> are good at breaking down the 64bit registers to 32bit w/o extra
> work...
>
>> > I also have an implementation of ghash that does a 4 bit lookup table
>> > version with the table split between cache lines in p4 at:
>> > https://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/opencrypto/sys/opencrypto/gfmult.c&REV=4
>>
>> I'll have to look at this, but haven't there been increasing
>> misgivings about table implementations for GHASH because of timing
>> attacks?
>
> Well, the code avoids one issue but introduces another issue...  Each
> table lookup accesses all four lines (assuming 64 byte cache lines), so
> there isn't a problem there...  The issue intrroduded is that the first
> 64 bits of a cache line are faster to access than the remaining bits...
>
> For more info, look at the NSS bug:
> https://bugzilla.mozilla.org/show_bug.cgi?id=868948
>
> Though considering that the AES implementation that FreeBSD is still
> using is the standard table based AES, there are also timing issues
> there too...  Yes, no excuse for opening up additional windows...
>
> I have thought about introducing an option of slow but secure and
> fast but insecure against local attackers...  Since we are already
> on the fast but insecure against local attackers path, I figured I'll
> take the extra performance...
>
> As with all things, there is a trade off between speed and security...
> As most IPsec gateways don't have a local user trying to target the
> key, this is less of an issue...  And even if they did, it'd probably
> be easier to use a local root exploit to get the key than try to mount
> a timing attack to extract a key that might change in an hour or a day...
>
> --
>   John-Mark Gurney                              Voice: +1 415 225 5579
>
>      "All that I will do, has been done, All that I have, has not."
>

Hi,

I've talked to Theo and it looks like we'll be importing your GF2
"multiplication library" as is.  I think we should concentrate on
making table version of gmac.c work better.

Cheers,
Mike

Reply via email to