improving OpenBSD's gmac.c...

2014-09-30 Thread John-Mark Gurney
So, as I was working on FreeBSD's implementation of gmac.c, I noticed
that I was able to get a significant speed up by using a mask instead
of an if branch in ghash_gfmul in gmac.c from OpenBSD...

Add a mask var and replace the code between the comments
"update Z" and "update V" w/:
mask = !!(x[i >> 3] & (1 << (~i & 7)));
mask = ~(mask - 1);

z[0] ^= v[0] & mask;
z[1] ^= v[1] & mask;
z[2] ^= v[2] & mask;
z[3] ^= v[3] & mask;

And you should see a nice performance increase...

I also have an implementation of ghash that does a 4 bit lookup table
version with the table split between cache lines in p4 at:
https://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/opencrypto/sys/opencrypto/gfmult.c&REV=4

This also has a version with does 4 blocks at a time getting a
further speed up...

-- 
  John-Mark Gurney  Voice: +1 415 225 5579

 "All that I will do, has been done, All that I have, has not."



Re: improving OpenBSD's gmac.c...

2014-10-07 Thread Christian Weisgerber
John-Mark Gurney:

> So, as I was working on FreeBSD's implementation of gmac.c, I noticed
> that I was able to get a significant speed up by using a mask instead
> of an if branch in ghash_gfmul in gmac.c from OpenBSD...
> 
> Add a mask var and replace the code between the comments
> "update Z" and "update V" w/:
> mask = !!(x[i >> 3] & (1 << (~i & 7)));
> mask = ~(mask - 1);
> 
> z[0] ^= v[0] & mask;
> z[1] ^= v[1] & mask;
> z[2] ^= v[2] & mask;
> z[3] ^= v[3] & mask;
> 
> And you should see a nice performance increase...

I tried this on a Soekris net6501-50 and the performance increase
was around 1.3%.  (I set up an ESP transport association with
AES-128-GMAC and pushed UDP traffic with tcpbench over it.)

A look at the generated amd64 assembly code shows that the change
indeed removes a branch.  What's pretty shocking is that this code

mul = v[3] & 1;
...
v[0] = (v[0] >> 1) ^ (0xe100 * mul);

is turned into an actual imul instruction by GCC.  I used the same
masking approach to get rid of the multiplication, but the improvement
was minuscule (<1%).

> I also have an implementation of ghash that does a 4 bit lookup table
> version with the table split between cache lines in p4 at:
> https://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/opencrypto/sys/opencrypto/gfmult.c&REV=4

I'll have to look at this, but haven't there been increasing
misgivings about table implementations for GHASH because of timing
attacks?

-- 
Christian "naddy" Weisgerber  na...@mips.inka.de



Re: improving OpenBSD's gmac.c...

2014-10-07 Thread John-Mark Gurney
Christian Weisgerber wrote this message on Tue, Oct 07, 2014 at 23:08 +0200:
> John-Mark Gurney:
> 
> > So, as I was working on FreeBSD's implementation of gmac.c, I noticed
> > that I was able to get a significant speed up by using a mask instead
> > of an if branch in ghash_gfmul in gmac.c from OpenBSD...
> > 
> > Add a mask var and replace the code between the comments
> > "update Z" and "update V" w/:
> > mask = !!(x[i >> 3] & (1 << (~i & 7)));
> > mask = ~(mask - 1);
> > 
> > z[0] ^= v[0] & mask;
> > z[1] ^= v[1] & mask;
> > z[2] ^= v[2] & mask;
> > z[3] ^= v[3] & mask;
> > 
> > And you should see a nice performance increase...
> 
> I tried this on a Soekris net6501-50 and the performance increase
> was around 1.3%.  (I set up an ESP transport association with
> AES-128-GMAC and pushed UDP traffic with tcpbench over it.)

Yeh, I benchmarked the raw algo in userland, not part of IPsec.  I
forget the resulting perf increase, but it was well more than 10-20%.

> A look at the generated amd64 assembly code shows that the change
> indeed removes a branch.  What's pretty shocking is that this code
> 
> mul = v[3] & 1;
> ...
> v[0] = (v[0] >> 1) ^ (0xe100 * mul);
> 
> is turned into an actual imul instruction by GCC.  I used the same
> masking approach to get rid of the multiplication, but the improvement
> was minuscule (<1%).

Hmm. interesting...  In my code I switched both to using the and
operator...

I also have code which switches the registers to 64bits so that on
amd64, we make better uses of registers, and on i386, the compilers
are good at breaking down the 64bit registers to 32bit w/o extra
work...

> > I also have an implementation of ghash that does a 4 bit lookup table
> > version with the table split between cache lines in p4 at:
> > https://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/opencrypto/sys/opencrypto/gfmult.c&REV=4
> 
> I'll have to look at this, but haven't there been increasing
> misgivings about table implementations for GHASH because of timing
> attacks?

Well, the code avoids one issue but introduces another issue...  Each
table lookup accesses all four lines (assuming 64 byte cache lines), so
there isn't a problem there...  The issue intrroduded is that the first
64 bits of a cache line are faster to access than the remaining bits...

For more info, look at the NSS bug:
https://bugzilla.mozilla.org/show_bug.cgi?id=868948

Though considering that the AES implementation that FreeBSD is still
using is the standard table based AES, there are also timing issues
there too...  Yes, no excuse for opening up additional windows...

I have thought about introducing an option of slow but secure and
fast but insecure against local attackers...  Since we are already
on the fast but insecure against local attackers path, I figured I'll
take the extra performance...

As with all things, there is a trade off between speed and security...
As most IPsec gateways don't have a local user trying to target the
key, this is less of an issue...  And even if they did, it'd probably
be easier to use a local root exploit to get the key than try to mount
a timing attack to extract a key that might change in an hour or a day...

-- 
  John-Mark Gurney  Voice: +1 415 225 5579

 "All that I will do, has been done, All that I have, has not."



Re: improving OpenBSD's gmac.c...

2014-10-08 Thread Mike Belopuhov
On 8 October 2014 00:48, John-Mark Gurney  wrote:
> Christian Weisgerber wrote this message on Tue, Oct 07, 2014 at 23:08 +0200:
>> John-Mark Gurney:
>>
>> > So, as I was working on FreeBSD's implementation of gmac.c, I noticed
>> > that I was able to get a significant speed up by using a mask instead
>> > of an if branch in ghash_gfmul in gmac.c from OpenBSD...
>> >
>> > Add a mask var and replace the code between the comments
>> > "update Z" and "update V" w/:
>> > mask = !!(x[i >> 3] & (1 << (~i & 7)));
>> > mask = ~(mask - 1);
>> >
>> > z[0] ^= v[0] & mask;
>> > z[1] ^= v[1] & mask;
>> > z[2] ^= v[2] & mask;
>> > z[3] ^= v[3] & mask;
>> >
>> > And you should see a nice performance increase...
>>
>> I tried this on a Soekris net6501-50 and the performance increase
>> was around 1.3%.  (I set up an ESP transport association with
>> AES-128-GMAC and pushed UDP traffic with tcpbench over it.)
>
> Yeh, I benchmarked the raw algo in userland, not part of IPsec.  I
> forget the resulting perf increase, but it was well more than 10-20%.
>
>> A look at the generated amd64 assembly code shows that the change
>> indeed removes a branch.  What's pretty shocking is that this code
>>
>> mul = v[3] & 1;
>> ...
>> v[0] = (v[0] >> 1) ^ (0xe100 * mul);
>>
>> is turned into an actual imul instruction by GCC.  I used the same
>> masking approach to get rid of the multiplication, but the improvement
>> was minuscule (<1%).
>
> Hmm. interesting...  In my code I switched both to using the and
> operator...
>
> I also have code which switches the registers to 64bits so that on
> amd64, we make better uses of registers, and on i386, the compilers
> are good at breaking down the 64bit registers to 32bit w/o extra
> work...
>
>> > I also have an implementation of ghash that does a 4 bit lookup table
>> > version with the table split between cache lines in p4 at:
>> > https://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/opencrypto/sys/opencrypto/gfmult.c&REV=4
>>
>> I'll have to look at this, but haven't there been increasing
>> misgivings about table implementations for GHASH because of timing
>> attacks?
>
> Well, the code avoids one issue but introduces another issue...  Each
> table lookup accesses all four lines (assuming 64 byte cache lines), so
> there isn't a problem there...  The issue intrroduded is that the first
> 64 bits of a cache line are faster to access than the remaining bits...
>
> For more info, look at the NSS bug:
> https://bugzilla.mozilla.org/show_bug.cgi?id=868948
>
> Though considering that the AES implementation that FreeBSD is still
> using is the standard table based AES, there are also timing issues
> there too...  Yes, no excuse for opening up additional windows...
>
> I have thought about introducing an option of slow but secure and
> fast but insecure against local attackers...  Since we are already
> on the fast but insecure against local attackers path, I figured I'll
> take the extra performance...
>
> As with all things, there is a trade off between speed and security...
> As most IPsec gateways don't have a local user trying to target the
> key, this is less of an issue...  And even if they did, it'd probably
> be easier to use a local root exploit to get the key than try to mount
> a timing attack to extract a key that might change in an hour or a day...
>
> --
>   John-Mark Gurney  Voice: +1 415 225 5579
>
>  "All that I will do, has been done, All that I have, has not."
>

Hi,

I've talked to Theo and it looks like we'll be importing your GF2
"multiplication library" as is.  I think we should concentrate on
making table version of gmac.c work better.

Cheers,
Mike



Re: improving OpenBSD's gmac.c...

2014-10-08 Thread John-Mark Gurney
Mike Belopuhov wrote this message on Wed, Oct 08, 2014 at 14:32 +0200:
> On 8 October 2014 00:48, John-Mark Gurney  wrote:
> > Christian Weisgerber wrote this message on Tue, Oct 07, 2014 at 23:08 +0200:
> >> John-Mark Gurney:
> >>
> >> > So, as I was working on FreeBSD's implementation of gmac.c, I noticed
> >> > that I was able to get a significant speed up by using a mask instead
> >> > of an if branch in ghash_gfmul in gmac.c from OpenBSD...
> >> >
> >> > Add a mask var and replace the code between the comments
> >> > "update Z" and "update V" w/:
> >> > mask = !!(x[i >> 3] & (1 << (~i & 7)));
> >> > mask = ~(mask - 1);
> >> >
> >> > z[0] ^= v[0] & mask;
> >> > z[1] ^= v[1] & mask;
> >> > z[2] ^= v[2] & mask;
> >> > z[3] ^= v[3] & mask;
> >> >
> >> > And you should see a nice performance increase...
> >>
> >> I tried this on a Soekris net6501-50 and the performance increase
> >> was around 1.3%.  (I set up an ESP transport association with
> >> AES-128-GMAC and pushed UDP traffic with tcpbench over it.)
> >
> > Yeh, I benchmarked the raw algo in userland, not part of IPsec.  I
> > forget the resulting perf increase, but it was well more than 10-20%.
> >
> >> A look at the generated amd64 assembly code shows that the change
> >> indeed removes a branch.  What's pretty shocking is that this code
> >>
> >> mul = v[3] & 1;
> >> ...
> >> v[0] = (v[0] >> 1) ^ (0xe100 * mul);
> >>
> >> is turned into an actual imul instruction by GCC.  I used the same
> >> masking approach to get rid of the multiplication, but the improvement
> >> was minuscule (<1%).
> >
> > Hmm. interesting...  In my code I switched both to using the and
> > operator...
> >
> > I also have code which switches the registers to 64bits so that on
> > amd64, we make better uses of registers, and on i386, the compilers
> > are good at breaking down the 64bit registers to 32bit w/o extra
> > work...
> >
> >> > I also have an implementation of ghash that does a 4 bit lookup table
> >> > version with the table split between cache lines in p4 at:
> >> > https://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/opencrypto/sys/opencrypto/gfmult.c&REV=4
> >>
> >> I'll have to look at this, but haven't there been increasing
> >> misgivings about table implementations for GHASH because of timing
> >> attacks?
> >
> > Well, the code avoids one issue but introduces another issue...  Each
> > table lookup accesses all four lines (assuming 64 byte cache lines), so
> > there isn't a problem there...  The issue intrroduded is that the first
> > 64 bits of a cache line are faster to access than the remaining bits...
> >
> > For more info, look at the NSS bug:
> > https://bugzilla.mozilla.org/show_bug.cgi?id=868948
> >
> > Though considering that the AES implementation that FreeBSD is still
> > using is the standard table based AES, there are also timing issues
> > there too...  Yes, no excuse for opening up additional windows...
> >
> > I have thought about introducing an option of slow but secure and
> > fast but insecure against local attackers...  Since we are already
> > on the fast but insecure against local attackers path, I figured I'll
> > take the extra performance...
> >
> > As with all things, there is a trade off between speed and security...
> > As most IPsec gateways don't have a local user trying to target the
> > key, this is less of an issue...  And even if they did, it'd probably
> > be easier to use a local root exploit to get the key than try to mount
> > a timing attack to extract a key that might change in an hour or a day...
> 
> I've talked to Theo and it looks like we'll be importing your GF2
> "multiplication library" as is.  I think we should concentrate on
> making table version of gmac.c work better.

Sounds good... Let me know of any issues you have...  I'll want to try
to keep the source in sync...

As for making the table version work better? do you mean closer to
constant time? or?

Thanks.

-- 
  John-Mark Gurney  Voice: +1 415 225 5579

 "All that I will do, has been done, All that I have, has not."



Re: improving OpenBSD's gmac.c...

2014-10-09 Thread Christian Weisgerber
John-Mark Gurney:

> I also have an implementation of ghash that does a 4 bit lookup table
> version with the table split between cache lines in p4 at:
> https://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/opencrypto/sys/opencrypto/gfmult.c&REV=4
> 
> This also has a version with does 4 blocks at a time getting a
> further speed up...

FWIW, I did a quick & dirty merge of this into the OpenBSD tree and
the speed of my test (net6501-50, tcpbench -u over esp aes-128-gmac)
almost doubled.

-- 
Christian "naddy" Weisgerber  na...@mips.inka.de



Re: improving OpenBSD's gmac.c...

2014-10-09 Thread Chris Cappuccio
Christian Weisgerber [na...@mips.inka.de] wrote:
> John-Mark Gurney:
> 
> > I also have an implementation of ghash that does a 4 bit lookup table
> > version with the table split between cache lines in p4 at:
> > https://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/opencrypto/sys/opencrypto/gfmult.c&REV=4
> > 
> > This also has a version with does 4 blocks at a time getting a
> > further speed up...
> 
> FWIW, I did a quick & dirty merge of this into the OpenBSD tree and
> the speed of my test (net6501-50, tcpbench -u over esp aes-128-gmac)
> almost doubled.
> 

That makes me horny.



Re: improving OpenBSD's gmac.c...

2014-10-09 Thread Damien Miller
On Thu, 9 Oct 2014, Christian Weisgerber wrote:

> John-Mark Gurney:
> 
> > I also have an implementation of ghash that does a 4 bit lookup table
> > version with the table split between cache lines in p4 at:
> > https://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/opencrypto/sys/opencrypto/gfmult.c&REV=4
> > 
> > This also has a version with does 4 blocks at a time getting a
> > further speed up...
> 
> FWIW, I did a quick & dirty merge of this into the OpenBSD tree and
> the speed of my test (net6501-50, tcpbench -u over esp aes-128-gmac)
> almost doubled.

isn't this likely to make it more likely to be subject to timing
attacks?

-d



Re: improving OpenBSD's gmac.c...

2014-10-12 Thread Christian Weisgerber
Here's a cleaned-up diff.  Briefly tested on amd64 & sparc64.  I'll
do some more testing tomorrow.  This already has mikeb@'s blessing.

Index: regress/sys/crypto/gmac/Makefile
===
RCS file: /cvs/src/regress/sys/crypto/gmac/Makefile,v
retrieving revision 1.2
diff -u -p -r1.2 Makefile
--- regress/sys/crypto/gmac/Makefile18 Jan 2014 05:54:52 -  1.2
+++ regress/sys/crypto/gmac/Makefile12 Oct 2014 19:05:35 -
@@ -3,7 +3,7 @@
 DIR=${.CURDIR}/../../../../sys
 
 PROG=  gmac_test
-SRCS+= rijndael.c gmac.c gmac_test.c
+SRCS+= rijndael.c gfmult.c gmac.c gmac_test.c
 CDIAGFLAGS=-Wall
 CDIAGFLAGS+=   -Werror
 CDIAGFLAGS+=   -Wpointer-arith
Index: sys/crypto/gfmult.c
===
RCS file: sys/crypto/gfmult.c
diff -N sys/crypto/gfmult.c
--- /dev/null   1 Jan 1970 00:00:00 -
+++ sys/crypto/gfmult.c 12 Oct 2014 17:28:42 -
@@ -0,0 +1,275 @@
+/*-
+ * Copyright (c) 2014 The FreeBSD Foundation
+ * All rights reserved.
+ *
+ * This software was developed by John-Mark Gurney under
+ * the sponsorship of the FreeBSD Foundation and
+ * Rubicon Communications, LLC (Netgate).
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1.  Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2.  Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * $FreeBSD$
+ *
+ */
+
+#include 
+
+#define REV_POLY_REDUCT0xe1/* 0x87 bit reversed */
+
+/* reverse the bits of a nibble */
+static const uint8_t nib_rev[] = {
+   0x0, 0x8, 0x4, 0xc, 0x2, 0xa, 0x6, 0xe,
+   0x1, 0x9, 0x5, 0xd, 0x3, 0xb, 0x7, 0xf,
+};
+
+/* calulate v * 2 */
+static inline struct gf128
+gf128_mulalpha(struct gf128 v)
+{
+   uint64_t mask;
+
+   mask = !!(v.v[1] & 1);
+   mask = ~(mask - 1);
+   v.v[1] = (v.v[1] >> 1) | ((v.v[0] & 1) << 63);
+   v.v[0] = (v.v[0] >> 1) ^ ((mask & REV_POLY_REDUCT) << 56);
+
+   return v;
+}
+
+/*
+ * Generate a table for 0-16 * h.  Store the results in the table w/ indexes
+ * bit reversed, and the words striped across the values.
+ */
+void
+gf128_genmultable(struct gf128 h, struct gf128table *t)
+{
+   struct gf128 tbl[16];
+   int i;
+
+   tbl[0] = MAKE_GF128(0, 0);
+   tbl[1] = h;
+
+   for (i = 2; i < 16; i += 2) {
+   tbl[i] = gf128_mulalpha(tbl[i / 2]);
+   tbl[i + 1] = gf128_add(tbl[i], h);
+   }
+
+   for (i = 0; i < 16; i++) {
+   t->a[nib_rev[i]] = tbl[i].v[0] >> 32;
+   t->b[nib_rev[i]] = tbl[i].v[0];
+   t->c[nib_rev[i]] = tbl[i].v[1] >> 32;
+   t->d[nib_rev[i]] = tbl[i].v[1];
+   }
+}
+
+/*
+ * Generate tables containing h, h^2, h^3 and h^4, starting at 0.
+ */
+void
+gf128_genmultable4(struct gf128 h, struct gf128table4 *t)
+{
+   struct gf128 h2, h3, h4;
+
+   gf128_genmultable(h, &t->tbls[0]);
+
+   h2 = gf128_mul(h, &t->tbls[0]);
+
+   gf128_genmultable(h2, &t->tbls[1]);
+
+   h3 = gf128_mul(h, &t->tbls[1]);
+   gf128_genmultable(h3, &t->tbls[2]);
+
+   h4 = gf128_mul(h2, &t->tbls[1]);
+   gf128_genmultable(h4, &t->tbls[3]);
+}
+
+/*
+ * Read a row from the table.
+ */
+static inline struct gf128
+readrow(struct gf128table *tbl, unsigned bits)
+{
+   struct gf128 r;
+
+   bits = bits % 16;
+
+   r.v[0] = ((uint64_t)tbl->a[bits] << 32) | tbl->b[bits];
+   r.v[1] = ((uint64_t)tbl->c[bits] << 32) | tbl->d[bits];
+
+   return r;
+}
+
+/*
+ * These are the reduction values.  Since we are dealing with bit reversed
+ * version, the values need to be bit reversed, AND the indexes are also
+ * bit reversed to make lookups quicker.
+ */
+static uint16_t reduction[] = {
+   0x, 0x1c20, 0x3840, 0x2460, 0x7080, 0x6ca0, 0x48c0, 0x54e0,
+   0xe100, 0xfd20, 0xd

Re: improving OpenBSD's gmac.c...

2014-10-12 Thread David Gwynne
dont you need endian.h to get bemtoh64 and htobem64?

On 13 Oct 2014, at 7:57, Christian Weisgerber  wrote:

> Here's a cleaned-up diff.  Briefly tested on amd64 & sparc64.  I'll
> do some more testing tomorrow.  This already has mikeb@'s blessing.
> 
> Index: regress/sys/crypto/gmac/Makefile
> ===
> RCS file: /cvs/src/regress/sys/crypto/gmac/Makefile,v
> retrieving revision 1.2
> diff -u -p -r1.2 Makefile
> --- regress/sys/crypto/gmac/Makefile  18 Jan 2014 05:54:52 -  1.2
> +++ regress/sys/crypto/gmac/Makefile  12 Oct 2014 19:05:35 -
> @@ -3,7 +3,7 @@
> DIR=${.CURDIR}/../../../../sys
> 
> PROG= gmac_test
> -SRCS+=   rijndael.c gmac.c gmac_test.c
> +SRCS+=   rijndael.c gfmult.c gmac.c gmac_test.c
> CDIAGFLAGS=   -Wall
> CDIAGFLAGS+=  -Werror
> CDIAGFLAGS+=  -Wpointer-arith
> Index: sys/crypto/gfmult.c
> ===
> RCS file: sys/crypto/gfmult.c
> diff -N sys/crypto/gfmult.c
> --- /dev/null 1 Jan 1970 00:00:00 -
> +++ sys/crypto/gfmult.c   12 Oct 2014 17:28:42 -
> @@ -0,0 +1,275 @@
> +/*-
> + * Copyright (c) 2014 The FreeBSD Foundation
> + * All rights reserved.
> + *
> + * This software was developed by John-Mark Gurney under
> + * the sponsorship of the FreeBSD Foundation and
> + * Rubicon Communications, LLC (Netgate).
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1.  Redistributions of source code must retain the above copyright
> + * notice, this list of conditions and the following disclaimer.
> + * 2.  Redistributions in binary form must reproduce the above copyright
> + * notice, this list of conditions and the following disclaimer in the
> + * documentation and/or other materials provided with the distribution.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + *
> + *   $FreeBSD$
> + *
> + */
> +
> +#include 
> +
> +#define REV_POLY_REDUCT  0xe1/* 0x87 bit reversed */
> +
> +/* reverse the bits of a nibble */
> +static const uint8_t nib_rev[] = {
> + 0x0, 0x8, 0x4, 0xc, 0x2, 0xa, 0x6, 0xe,
> + 0x1, 0x9, 0x5, 0xd, 0x3, 0xb, 0x7, 0xf,
> +};
> +
> +/* calulate v * 2 */
> +static inline struct gf128
> +gf128_mulalpha(struct gf128 v)
> +{
> + uint64_t mask;
> +
> + mask = !!(v.v[1] & 1);
> + mask = ~(mask - 1);
> + v.v[1] = (v.v[1] >> 1) | ((v.v[0] & 1) << 63);
> + v.v[0] = (v.v[0] >> 1) ^ ((mask & REV_POLY_REDUCT) << 56);
> +
> + return v;
> +}
> +
> +/*
> + * Generate a table for 0-16 * h.  Store the results in the table w/ indexes
> + * bit reversed, and the words striped across the values.
> + */
> +void
> +gf128_genmultable(struct gf128 h, struct gf128table *t)
> +{
> + struct gf128 tbl[16];
> + int i;
> +
> + tbl[0] = MAKE_GF128(0, 0);
> + tbl[1] = h;
> +
> + for (i = 2; i < 16; i += 2) {
> + tbl[i] = gf128_mulalpha(tbl[i / 2]);
> + tbl[i + 1] = gf128_add(tbl[i], h);
> + }
> +
> + for (i = 0; i < 16; i++) {
> + t->a[nib_rev[i]] = tbl[i].v[0] >> 32;
> + t->b[nib_rev[i]] = tbl[i].v[0];
> + t->c[nib_rev[i]] = tbl[i].v[1] >> 32;
> + t->d[nib_rev[i]] = tbl[i].v[1];
> + }
> +}
> +
> +/*
> + * Generate tables containing h, h^2, h^3 and h^4, starting at 0.
> + */
> +void
> +gf128_genmultable4(struct gf128 h, struct gf128table4 *t)
> +{
> + struct gf128 h2, h3, h4;
> +
> + gf128_genmultable(h, &t->tbls[0]);
> +
> + h2 = gf128_mul(h, &t->tbls[0]);
> +
> + gf128_genmultable(h2, &t->tbls[1]);
> +
> + h3 = gf128_mul(h, &t->tbls[1]);
> + gf128_genmultable(h3, &t->tbls[2]);
> +
> + h4 = gf128_mul(h2, &t->tbls[1]);
> + gf128_genmultable(h4, &t->tbls[3]);
> +}
> +
> +/*
> + * Read a row from the table.
> + */
> +static inline struct gf128
> +readrow(struct gf128table *tbl, unsigned bits)
> +{
> + struct gf128 r;
> +
> + bits = bits % 16;
> +
> + r.v[0] = ((uint64_t)tbl->a[bits] << 32) | tbl->b[bits];
> + r.v[1] = ((uint64_t)tbl->c[bits] << 32) | tbl->d[bits];
> +
> + return r;
> +}
> +
> +/*

Re: improving OpenBSD's gmac.c...

2014-10-13 Thread Christian Weisgerber
David Gwynne:

> dont you need endian.h to get bemtoh64 and htobem64?

I went by the man page, which says .


PS: I accidentally omitted a chunk:

Index: sys/conf/files
===
RCS file: /cvs/src/sys/conf/files,v
retrieving revision 1.581
diff -u -p -r1.581 files
--- sys/conf/files  9 Oct 2014 08:21:48 -   1.581
+++ sys/conf/files  12 Oct 2014 17:29:33 -
@@ -870,6 +870,7 @@ file crypto/arc4.c  
 file crypto/michael.c  wlan
 file crypto/cmac.c wlan
 file crypto/hmac.c wlan | (softraid & crypto)
+file crypto/gfmult.c   (inet & ipsec) | crypto
 file crypto/gmac.c (inet & ipsec) | crypto
 file crypto/key_wrap.c wlan
 file crypto/idgen.cinet6 | nfsclient | nfsserver
-- 
Christian "naddy" Weisgerber  na...@mips.inka.de



Re: improving OpenBSD's gmac.c...

2014-10-13 Thread David Gwynne

On 13 Oct 2014, at 9:00 pm, Christian Weisgerber  wrote:

> David Gwynne:
> 
>> dont you need endian.h to get bemtoh64 and htobem64?
> 
> I went by the man page, which says .

and cvs blame says that's my fault... fair enough.

ill talk to philip about whether that should change, but dont let that hold the 
process up on this code.

dlg




Re: improving OpenBSD's gmac.c...

2014-10-13 Thread Philip Guenther
On Mon, Oct 13, 2014 at 5:26 AM, David Gwynne  wrote:
> On 13 Oct 2014, at 9:00 pm, Christian Weisgerber  wrote:
>
>> David Gwynne:
>>
>>> dont you need endian.h to get bemtoh64 and htobem64?

(This is kernel code, so that would be )


>> I went by the man page, which says .
>
> and cvs blame says that's my fault... fair enough.
>
> ill talk to philip about whether that should change, but dont let that hold 
> the process up on this code.

Well.  Right now  pulls in  (or at least
); someone with lots of patience and many archs to test
on could try removing that (and making  pull in
 directly!) and see how many .c files would need to
change and, more importantly, how badly ports break.

Until then, I see little benefit for kernel code to pull in
; you'll get it indirectly after pulling in
 or , so it's just noise and wasted compiler
cycles.  The manpages for the kernel-only interfaces (bemtoh*, etc)
should presumably follow that: recommending sys/types.h until someone
removes the indirect #include.


Philip Guenther



Re: improving OpenBSD's gmac.c...

2014-11-12 Thread Mike Belopuhov
On 10 October 2014 02:39, Damien Miller  wrote:
> On Thu, 9 Oct 2014, Christian Weisgerber wrote:
>
>> John-Mark Gurney:
>>
>> > I also have an implementation of ghash that does a 4 bit lookup table
>> > version with the table split between cache lines in p4 at:
>> > https://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/opencrypto/sys/opencrypto/gfmult.c&REV=4
>> >
>> > This also has a version with does 4 blocks at a time getting a
>> > further speed up...
>>
>> FWIW, I did a quick & dirty merge of this into the OpenBSD tree and
>> the speed of my test (net6501-50, tcpbench -u over esp aes-128-gmac)
>> almost doubled.
>
> isn't this likely to make it more likely to be subject to timing
> attacks?
>

then how is this different to our table based aes implementation?
and it's the same C code as in openssl which also uses table based
gcm implementation.

what countermeasures can be applied to the table lookup code
to fight these attacks?



Re: improving OpenBSD's gmac.c...

2014-11-12 Thread John-Mark Gurney
Mike Belopuhov wrote this message on Wed, Nov 12, 2014 at 19:05 +0100:
> On 10 October 2014 02:39, Damien Miller  wrote:
> > On Thu, 9 Oct 2014, Christian Weisgerber wrote:
> >
> >> John-Mark Gurney:
> >>
> >> > I also have an implementation of ghash that does a 4 bit lookup table
> >> > version with the table split between cache lines in p4 at:
> >> > https://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/opencrypto/sys/opencrypto/gfmult.c&REV=4
> >> >
> >> > This also has a version with does 4 blocks at a time getting a
> >> > further speed up...
> >>
> >> FWIW, I did a quick & dirty merge of this into the OpenBSD tree and
> >> the speed of my test (net6501-50, tcpbench -u over esp aes-128-gmac)
> >> almost doubled.
> >
> > isn't this likely to make it more likely to be subject to timing
> > attacks?
> 
> then how is this different to our table based aes implementation?

My gfmul code spreads the table out over 4 64-byte arrays (common
cache line size), and to read an entry, it must access all four...
This doesn't mitigate entirely the cache timing attacks due to the
fact that the first 64-bits of a cache line are faster:
https://bugzilla.mozilla.org/show_bug.cgi?id=868948#c5

> and it's the same C code as in openssl which also uses table based
> gcm implementation.
> 
> what countermeasures can be applied to the table lookup code
> to fight these attacks?

If you mean for AES, there is a version of AES that uses SSE instructions
to bit slice the AES S-box and caclulate the S-box as if it were a
set of logic gates...  See: Faster and Timing-Attack Resistant AES-GCM
by Emilia Kasper and Peter Schwabe...  But obviously that requires FPU
context saving...

One of the issues today is that most of the research for implementations
of algorithms assume you have access to SSE operations, there isn't much
research going into making safe implementations that aren't using SSE...

-- 
  John-Mark Gurney  Voice: +1 415 225 5579

 "All that I will do, has been done, All that I have, has not."



Re: improving OpenBSD's gmac.c...

2014-11-12 Thread Damien Miller
On Wed, 12 Nov 2014, Mike Belopuhov wrote:

> > isn't this likely to make it more likely to be subject to timing
> > attacks?
> >
> 
> then how is this different to our table based aes implementation?
> and it's the same C code as in openssl which also uses table based
> gcm implementation.

Yeah, that's crappy too - IMO we should definitely look at replacing it,
but I haven't found an implementation that is a) native C, b) doesn't
use FP/SSE tricksies and c) is acceptably fast.

OpenSSL only falls back to the table-based code if none of the assembler
versions are selected for the platform. AFAIK many of these are constant
time.

> what countermeasures can be applied to the table lookup code
> to fight these attacks?

It's possible to do dummy accesses, but these do slow things down and
many have been broken anyway.

-d