Re: [PATCH 2/3] sha512: reduce stack usage to safe number

2012-01-17 Thread Alexey Dobriyan
On 1/16/12, Eric Dumazet eric.duma...@gmail.com wrote:
 Le lundi 16 janvier 2012 à 09:56 +, David Laight a écrit :
 Doesn't this badly overflow W[] ..

  +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
  +  t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i];  \
 ...
  +  for (i = 0; i  16; i += 8) {
 ...
  +  SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
  +  }

  David



 No overflow since loop is done for only i=0 and i=8

 By the way, I suspect previous code was chosen years ago because this
 version uses less stack but adds much more code bloat.

I think W[80] was use because it's the most straightforward way to
write this code
by following spec.

All SHA definitions have full message schedule pseudocoded
before hash computation.

 size crypto/sha512_generic.o crypto/sha512_generic_old.o
text  data bss dec hex filename
   17369   704   0   180734699 crypto/sha512_generic.o
8249   704   0895322f9 crypto/sha512_generic_old.o

This is because SHA-512 is fundamentally 64-bit algorithm multiplied by
excessive unrolling.

Surprisingly, doing variable renaming by hand like in spec:
t1 = ...
t2 = ...
h = g;
g = f;
f = e;
e = d + T1;
d = c;
c = b;
b = a;
a = t1 + t2;

bring stack space on i386 under control too.
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3] sha512: reduce stack usage to safe number

2012-01-16 Thread David Laight
Doesn't this badly overflow W[] ..
 
 +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
 + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i];  \
...
 + for (i = 0; i  16; i += 8) {
...
 + SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
 + }

David


--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] sha512: reduce stack usage to safe number

2012-01-16 Thread Alexey Dobriyan
On 1/16/12, David Laight david.lai...@aculab.com wrote:
 Doesn't this badly overflow W[] ..

 +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
 +t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i];  \
 ...
 +for (i = 0; i  16; i += 8) {
 ...
 +SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
 +}

No, why should it?
i can be only 0 and 8.
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3] sha512: reduce stack usage to safe number

2012-01-16 Thread Eric Dumazet
Le lundi 16 janvier 2012 à 09:56 +, David Laight a écrit :
 Doesn't this badly overflow W[] ..
  
  +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
  +   t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i];  \
 ...
  +   for (i = 0; i  16; i += 8) {
 ...
  +   SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
  +   }
 
   David
 
 

No overflow since loop is done for only i=0 and i=8

By the way, I suspect previous code was chosen years ago because this
version uses less stack but adds much more code bloat.


size crypto/sha512_generic.o crypto/sha512_generic_old.o
   textdata bss dec hex filename
  17369 704   0   180734699 crypto/sha512_generic.o
   8249 704   0895322f9 crypto/sha512_generic_old.o



--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] sha512: reduce stack usage to safe number

2012-01-14 Thread Linus Torvalds
On Sat, Jan 14, 2012 at 10:40 AM, Alexey Dobriyan adobri...@gmail.com wrote:

 Line by line explanation:
 * BLEND_OP
  array is circular now, all indexes have to be modulo 16.
  Round number is positive, so remainder operation should be
  without surprises.

Don't use % except on unsigned values. Even if it's positive, if
it's a signed number and the compiler doesn't *see* that it is
absolutely positive, division is nontrivial. Even when you divide by a
constant.

For example, % 16 on an 'int' on x86-64 will generate

movl%edi, %edx
sarl$31, %edx
shrl$28, %edx
leal(%rdi,%rdx), %eax
andl$15, %eax
subl%edx, %eax

in order to get the signed case right. The fact that the end result is
correct for unsigned numbers is irrelevant: it's still stupid and
slow.

With an unsigned int, '% 16' will generate the obvious

andl$15, %eax

instead.

Quite frankly, stop using division in the first place. Dividing by
powers-of-two and expecting the compiler to fix things up is just
stupid, *exactly* because of issues like these: you either have to
think about it carefully, or the compiler may end up creating crap
code.

So just use  15 instead. That doesn't have these kinds of issues.
It is a *good* thing when the C code is close to the end result you
want to generate. It is *not* a good thing to write code that looks
nothing like the end result and just expect the compiler to do the
right thing. Even if the compiler does do the right thing, what was
the advantage?

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] sha512: reduce stack usage to safe number

2012-01-14 Thread Alexey Dobriyan
On Sat, Jan 14, 2012 at 11:08:45AM -0800, Linus Torvalds wrote:
 On Sat, Jan 14, 2012 at 10:40 AM, Alexey Dobriyan adobri...@gmail.com wrote:
 
  Line by line explanation:
  * BLEND_OP
   array is circular now, all indexes have to be modulo 16.
   Round number is positive, so remainder operation should be
   without surprises.
 
 Don't use % except on unsigned values. Even if it's positive, if
 it's a signed number and the compiler doesn't *see* that it is
 absolutely positive, division is nontrivial. Even when you divide by a
 constant.
 
 For example, % 16 on an 'int' on x86-64 will generate
 
   movl%edi, %edx
   sarl$31, %edx
   shrl$28, %edx
   leal(%rdi,%rdx), %eax
   andl$15, %eax
   subl%edx, %eax
 
 in order to get the signed case right. The fact that the end result is
 correct for unsigned numbers is irrelevant: it's still stupid and
 slow.
 
 With an unsigned int, '% 16' will generate the obvious
 
   andl$15, %eax
 
 instead.
 
 Quite frankly, stop using division in the first place. Dividing by
 powers-of-two and expecting the compiler to fix things up is just
 stupid, *exactly* because of issues like these: you either have to
 think about it carefully, or the compiler may end up creating crap
 code.

For the record, it generates andl $15 here.

 So just use  15 instead. That doesn't have these kinds of issues.
 It is a *good* thing when the C code is close to the end result you
 want to generate. It is *not* a good thing to write code that looks
 nothing like the end result and just expect the compiler to do the
 right thing. Even if the compiler does do the right thing, what was
 the advantage?

Here is updated patch which explicitly uses  (equally tested):
---

For rounds 16--79, W[i] only depends on W[i - 2], W[i - 7], W[i - 15]
and W[i - 16]. Consequently, keeping all W[80] array on stack is
unnecessary, only 16 values are really needed.

Using W[16] instead of W[80] greatly reduces stack usage
(~750 bytes to ~340 bytes on x86_64).

Line by line explanation:
* BLEND_OP
  array is circular now, all indexes have to be modulo 16.

* initial full message scheduling is trimmed to first 16 values which
  come from data block, the rest is calculated right before it's needed.

* original loop body is unrolled version of new SHA512_0_15 and
  SHA512_16_79 macros, unrolling was done to not do explicit variable
  renaming. Otherwise it's the very same code after preprocessing.
  See sha1_transform() code which does the same trick.

Patch survives in-tree crypto test and original bugreport test
(ping flood with hmac(sha512).

See FIPS 180-2 for SHA-512 definition
http://csrc.nist.gov/publications/fips/fips180-2/fips180-2withchangenotice.pdf

Signed-off-by: Alexey Dobriyan adobri...@gmail.com
Cc: sta...@vger.kernel.org
---

 This is patch is for stable if 750 byte stack usage is not
 considered safe.

 crypto/sha512_generic.c |   58 
 1 file changed, 34 insertions(+), 24 deletions(-)

--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -78,7 +78,7 @@ static inline void LOAD_OP(int I, u64 *W, const u8 *input)
 
 static inline void BLEND_OP(int I, u64 *W)
 {
-   W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
+   W[I  15] += s1(W[(I-2)  15]) + W[(I-7)  15] + s0(W[(I-15)  15]);
 }
 
 static void
@@ -87,38 +87,48 @@ sha512_transform(u64 *state, const u8 *input)
u64 a, b, c, d, e, f, g, h, t1, t2;
 
int i;
-   u64 W[80];
+   u64 W[16];
 
/* load the input */
 for (i = 0; i  16; i++)
 LOAD_OP(i, W, input);
 
-for (i = 16; i  80; i++) {
-BLEND_OP(i, W);
-}
-
/* load the state into our registers */
a=state[0];   b=state[1];   c=state[2];   d=state[3];
e=state[4];   f=state[5];   g=state[6];   h=state[7];
 
-   /* now iterate */
-   for (i=0; i80; i+=8) {
-   t1 = h + e1(e) + Ch(e,f,g) + sha512_K[i  ] + W[i  ];
-   t2 = e0(a) + Maj(a,b,c);d+=t1;h=t1+t2;
-   t1 = g + e1(d) + Ch(d,e,f) + sha512_K[i+1] + W[i+1];
-   t2 = e0(h) + Maj(h,a,b);c+=t1;g=t1+t2;
-   t1 = f + e1(c) + Ch(c,d,e) + sha512_K[i+2] + W[i+2];
-   t2 = e0(g) + Maj(g,h,a);b+=t1;f=t1+t2;
-   t1 = e + e1(b) + Ch(b,c,d) + sha512_K[i+3] + W[i+3];
-   t2 = e0(f) + Maj(f,g,h);a+=t1;e=t1+t2;
-   t1 = d + e1(a) + Ch(a,b,c) + sha512_K[i+4] + W[i+4];
-   t2 = e0(e) + Maj(e,f,g);h+=t1;d=t1+t2;
-   t1 = c + e1(h) + Ch(h,a,b) + sha512_K[i+5] + W[i+5];
-   t2 = e0(d) + Maj(d,e,f);g+=t1;c=t1+t2;
-   t1 = b + e1(g) + Ch(g,h,a) + sha512_K[i+6] + W[i+6];
-   t2 = e0(c) + Maj(c,d,e);f+=t1;b=t1+t2;
-   t1 = a + e1(f) + Ch(f,g,h) + 

Re: [PATCH 2/3] sha512: reduce stack usage to safe number

2012-01-14 Thread Linus Torvalds
On Sat, Jan 14, 2012 at 12:41 PM, Alexey Dobriyan adobri...@gmail.com wrote:

 For the record, it generates andl $15 here.

Ok. That means that gcc was able to prove that it never had any signed
values (which is certainly reasonable when you do things like for
(i=0; iX;i++)). But it's better to simply not rely on gcc always
getting details like this right.

It's also better to use a model that simply doesn't even require you
as a programmer to have to even *think* about signed values.

It's easy to get % wrong by mistake (signed integer modulus didn't
even use to have well-defined semantics in traditional C), and there
is almost never any excuse for using it for powers-of-two.

 Here is updated patch which explicitly uses  (equally tested):

Thanks,

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html