On 8 October 2014 00:48, John-Mark Gurney <j...@funkthat.com> wrote: > Christian Weisgerber wrote this message on Tue, Oct 07, 2014 at 23:08 +0200: >> John-Mark Gurney: >> >> > So, as I was working on FreeBSD's implementation of gmac.c, I noticed >> > that I was able to get a significant speed up by using a mask instead >> > of an if branch in ghash_gfmul in gmac.c from OpenBSD... >> > >> > Add a mask var and replace the code between the comments >> > "update Z" and "update V" w/: >> > mask = !!(x[i >> 3] & (1 << (~i & 7))); >> > mask = ~(mask - 1); >> > >> > z[0] ^= v[0] & mask; >> > z[1] ^= v[1] & mask; >> > z[2] ^= v[2] & mask; >> > z[3] ^= v[3] & mask; >> > >> > And you should see a nice performance increase... >> >> I tried this on a Soekris net6501-50 and the performance increase >> was around 1.3%. (I set up an ESP transport association with >> AES-128-GMAC and pushed UDP traffic with tcpbench over it.) > > Yeh, I benchmarked the raw algo in userland, not part of IPsec. I > forget the resulting perf increase, but it was well more than 10-20%. > >> A look at the generated amd64 assembly code shows that the change >> indeed removes a branch. What's pretty shocking is that this code >> >> mul = v[3] & 1; >> ... >> v[0] = (v[0] >> 1) ^ (0xe1000000 * mul); >> >> is turned into an actual imul instruction by GCC. I used the same >> masking approach to get rid of the multiplication, but the improvement >> was minuscule (<1%). > > Hmm. interesting... In my code I switched both to using the and > operator... > > I also have code which switches the registers to 64bits so that on > amd64, we make better uses of registers, and on i386, the compilers > are good at breaking down the 64bit registers to 32bit w/o extra > work... > >> > I also have an implementation of ghash that does a 4 bit lookup table >> > version with the table split between cache lines in p4 at: >> > https://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/opencrypto/sys/opencrypto/gfmult.c&REV=4 >> >> I'll have to look at this, but haven't there been increasing >> misgivings about table implementations for GHASH because of timing >> attacks? > > Well, the code avoids one issue but introduces another issue... Each > table lookup accesses all four lines (assuming 64 byte cache lines), so > there isn't a problem there... The issue intrroduded is that the first > 64 bits of a cache line are faster to access than the remaining bits... > > For more info, look at the NSS bug: > https://bugzilla.mozilla.org/show_bug.cgi?id=868948 > > Though considering that the AES implementation that FreeBSD is still > using is the standard table based AES, there are also timing issues > there too... Yes, no excuse for opening up additional windows... > > I have thought about introducing an option of slow but secure and > fast but insecure against local attackers... Since we are already > on the fast but insecure against local attackers path, I figured I'll > take the extra performance... > > As with all things, there is a trade off between speed and security... > As most IPsec gateways don't have a local user trying to target the > key, this is less of an issue... And even if they did, it'd probably > be easier to use a local root exploit to get the key than try to mount > a timing attack to extract a key that might change in an hour or a day... > > -- > John-Mark Gurney Voice: +1 415 225 5579 > > "All that I will do, has been done, All that I have, has not." >
Hi, I've talked to Theo and it looks like we'll be importing your GF2 "multiplication library" as is. I think we should concentrate on making table version of gmac.c work better. Cheers, Mike