Hi,
Here is my latest results and patch. Please find the patches to
sha1.c, sha256.c and sh512.c attached and the "time" of the resulting
binaries in sha_benchs.log. For all binaries, in 64 and 32 bits modes
(.m32), I run 3 times the command "\time sha*sum zero1G" where zero1G
is a 10^9 bytes file created by the command:
dd if=/dev/zero of=zero1G count=1 bs=1 seek=$(( 1000 * 1000 * 1000 - 1 ))
The compilation of coreutils was done using the command
make CFLAGS="-O3"
for 64 bit version and
make CFLAGS="-m32 -O3"
for 32 bit version.
gcc is version 4.4.5 (Ubuntu 10.10)
My CPU is a Sandy Bridge @2.5GHz.
For sha1, the result is very close to Linus' version for git.
I think it could be a good idea to include thoses patches to improve
the C versions, it is probably close to the best it can be done in
"pure" C.
To improve further, assembly with or without SSE could be done in a second pass.
What to you think of that ?
I don't have a GCC farm access yet, so I can only test on my system for now.
Best regards.
Loïc
2011/9/6 Pádraig Brady <[email protected]>:
> On 09/06/2011 02:25 PM, Loďc Le Loarer wrote:
>> Hi Pádraig,
>>
>> Thank you for your answer.
>>
>> 2011/9/6 Pádraig Brady <[email protected] <mailto:[email protected]>>
>>
>> A few general points.
>> You essentially used Linus' code (albeit by
>> very helpfully isolating the significant differences).
>> It might be easier/required to just include it in gnulib?
>> There are a few files in gnulib that are not copyright of the FSF,
>> so would Nicolas and Linus need to assign copyright?
>>
>>
>> Yes, this is what I did. I don't thing that including Linus' is easier as
>> the functions have a different prototype. Also, sha1, sha256 and sha512
>> share the same structure in gnulib, changing one without changing the other
>> would be weird. But if you thing it is required, I have not problem with
>> that.
>
> Ok, let's just use your patches to gnulib so.
> The techniques were fairly generic anyway.
>
>>
>> By the way, I have done a test on sha512 and I have improved the speed on
>> the same 1Gb zero file from 4.5 to 3.9s. Please find the patch attached. So
>> I thing that using the same technics, we could improve all sha's speed.
>>
>> For performance testing I've found gcc generates
>> much more deterministic results with a -march
>> as close to native as possible or otherwise
>> the code is very susceptible to alignment issues etc.
>> Your compiler supports -march=native.
>> Note also gcc 4.6 has much better support for your sandy bridge CPU,
>> either with -march=native or -march=corei7-avx
>>
>>
>> I tried using gcc-4.6.1 (I recompiled it under my ubuntu 10.10) but I
>> couldn't see any differences. For me, using any combination of -march=native
>> or not and gcc 4.4.5 or 4.6.1 doesn't make a difference, all the times are
>> in the measurement margin.
>
> OK that at least confirms the improvement is fairly deterministic.
>
>>
>> As for the SSE version, I would also like to see that included,
>> given the proportion of hardware supporting that these days.
>> I previously noticed a coreutils SSE2 patch here:
>> http://www.arctic.org/~dean/crypto/sha1.html
>> <http://www.arctic.org/%7Edean/crypto/sha1.html>
>> Though we'd probably need some runtime SSE detection to include that.
>>
>>
>> Ok, I could try to work on this. The real problem is to test that
>> compilation and SSE detection is done correctly on several platform. I only
>> have access to a few x86 machines, what is the usual way to test more
>> platforms ?
>
> It would probably be best to get an account on the GCC compile farm.
> http://gcc.gnu.org/wiki/CompileFarm
>
> cheers,
> Pádraig.
>
--
Loïc
--- lib/sha1.c.orig 2011-09-12 15:22:57.104202004 +0200
+++ lib/sha1.c 2011-09-12 16:32:41.714201999 +0200
@@ -29,6 +29,7 @@
#include <stddef.h>
#include <stdlib.h>
#include <string.h>
+#include <arpa/inet.h>
#if USE_UNLOCKED_IO
# include "unlocked-io.h"
@@ -316,112 +317,138 @@
#define rol(x, n) (((x) << (n)) | ((uint32_t) (x) >> (32 - (n))))
-#define M(I) ( tm = x[I&0x0f] ^ x[(I-14)&0x0f] \
- ^ x[(I-8)&0x0f] ^ x[(I-3)&0x0f] \
- , (x[I&0x0f] = rol(tm, 1)) )
+#define X(I) ntohl(words[I])
-#define R(A,B,C,D,E,F,K,M) do { E += rol( A, 5 ) \
+/*
+ * If you have 32 registers or more, the compiler can (and should)
+ * try to change the array[] accesses into registers. However, on
+ * machines with less than ~25 registers, that won't really work,
+ * and at least gcc will make an unholy mess of it.
+ *
+ * So to avoid that mess which just slows things down, we force
+ * the stores to memory to actually happen (we might be better off
+ * with a 'W(t)=(val);asm("":"+m" (W(t))' there instead, as
+ * suggested by Artur Skawina - that will also make gcc unable to
+ * try to do the silly "optimize away loads" part because it won't
+ * see what the value will be).
+ *
+ * Ben Herrenschmidt reports that on PPC, the C version comes close
+ * to the optimized asm with this (ie on PPC you don't want that
+ * 'volatile', since there are lots of registers).
+ *
+ * On ARM we get the best code generation by forcing a full memory barrier
+ * between each SHA_ROUND, otherwise gcc happily get wild with spilling and
+ * the stack frame size simply explode and performance goes down the drain.
+ */
+
+#if defined(__i386__) || defined(__x86_64__)
+ #define setX(I, val) (*(volatile unsigned int *)&x[(I)&0x0f] = (val))
+#elif defined(__GNUC__) && defined(__arm__)
+ #define setX(I, val) do { x[(I)&0x0f] = (val); __asm__("":::"memory"); } while (0)
+#else
+ #define setX(I, val) (x[(I)&0x0f] = (val))
+#endif
+
+#define M(I) rol( x[I&0x0f] ^ x[(I-14)&0x0f] \
+ ^ x[(I-8)&0x0f] ^ x[(I-3)&0x0f], 1) \
+
+#define R(A,B,C,D,E,F,K,M,I) do { unsigned int TEMP = M(I); \
+ setX(I,TEMP);\
+ E += rol( A, 5 ) \
+ F( B, C, D ) \
+ K \
- + M; \
+ + TEMP; \
B = rol( B, 30 ); \
} while(0)
while (words < endp)
{
- uint32_t tm;
- int t;
- for (t = 0; t < 16; t++)
- {
- x[t] = SWAP (*words);
- words++;
- }
-
- R( a, b, c, d, e, F1, K1, x[ 0] );
- R( e, a, b, c, d, F1, K1, x[ 1] );
- R( d, e, a, b, c, F1, K1, x[ 2] );
- R( c, d, e, a, b, F1, K1, x[ 3] );
- R( b, c, d, e, a, F1, K1, x[ 4] );
- R( a, b, c, d, e, F1, K1, x[ 5] );
- R( e, a, b, c, d, F1, K1, x[ 6] );
- R( d, e, a, b, c, F1, K1, x[ 7] );
- R( c, d, e, a, b, F1, K1, x[ 8] );
- R( b, c, d, e, a, F1, K1, x[ 9] );
- R( a, b, c, d, e, F1, K1, x[10] );
- R( e, a, b, c, d, F1, K1, x[11] );
- R( d, e, a, b, c, F1, K1, x[12] );
- R( c, d, e, a, b, F1, K1, x[13] );
- R( b, c, d, e, a, F1, K1, x[14] );
- R( a, b, c, d, e, F1, K1, x[15] );
- R( e, a, b, c, d, F1, K1, M(16) );
- R( d, e, a, b, c, F1, K1, M(17) );
- R( c, d, e, a, b, F1, K1, M(18) );
- R( b, c, d, e, a, F1, K1, M(19) );
- R( a, b, c, d, e, F2, K2, M(20) );
- R( e, a, b, c, d, F2, K2, M(21) );
- R( d, e, a, b, c, F2, K2, M(22) );
- R( c, d, e, a, b, F2, K2, M(23) );
- R( b, c, d, e, a, F2, K2, M(24) );
- R( a, b, c, d, e, F2, K2, M(25) );
- R( e, a, b, c, d, F2, K2, M(26) );
- R( d, e, a, b, c, F2, K2, M(27) );
- R( c, d, e, a, b, F2, K2, M(28) );
- R( b, c, d, e, a, F2, K2, M(29) );
- R( a, b, c, d, e, F2, K2, M(30) );
- R( e, a, b, c, d, F2, K2, M(31) );
- R( d, e, a, b, c, F2, K2, M(32) );
- R( c, d, e, a, b, F2, K2, M(33) );
- R( b, c, d, e, a, F2, K2, M(34) );
- R( a, b, c, d, e, F2, K2, M(35) );
- R( e, a, b, c, d, F2, K2, M(36) );
- R( d, e, a, b, c, F2, K2, M(37) );
- R( c, d, e, a, b, F2, K2, M(38) );
- R( b, c, d, e, a, F2, K2, M(39) );
- R( a, b, c, d, e, F3, K3, M(40) );
- R( e, a, b, c, d, F3, K3, M(41) );
- R( d, e, a, b, c, F3, K3, M(42) );
- R( c, d, e, a, b, F3, K3, M(43) );
- R( b, c, d, e, a, F3, K3, M(44) );
- R( a, b, c, d, e, F3, K3, M(45) );
- R( e, a, b, c, d, F3, K3, M(46) );
- R( d, e, a, b, c, F3, K3, M(47) );
- R( c, d, e, a, b, F3, K3, M(48) );
- R( b, c, d, e, a, F3, K3, M(49) );
- R( a, b, c, d, e, F3, K3, M(50) );
- R( e, a, b, c, d, F3, K3, M(51) );
- R( d, e, a, b, c, F3, K3, M(52) );
- R( c, d, e, a, b, F3, K3, M(53) );
- R( b, c, d, e, a, F3, K3, M(54) );
- R( a, b, c, d, e, F3, K3, M(55) );
- R( e, a, b, c, d, F3, K3, M(56) );
- R( d, e, a, b, c, F3, K3, M(57) );
- R( c, d, e, a, b, F3, K3, M(58) );
- R( b, c, d, e, a, F3, K3, M(59) );
- R( a, b, c, d, e, F4, K4, M(60) );
- R( e, a, b, c, d, F4, K4, M(61) );
- R( d, e, a, b, c, F4, K4, M(62) );
- R( c, d, e, a, b, F4, K4, M(63) );
- R( b, c, d, e, a, F4, K4, M(64) );
- R( a, b, c, d, e, F4, K4, M(65) );
- R( e, a, b, c, d, F4, K4, M(66) );
- R( d, e, a, b, c, F4, K4, M(67) );
- R( c, d, e, a, b, F4, K4, M(68) );
- R( b, c, d, e, a, F4, K4, M(69) );
- R( a, b, c, d, e, F4, K4, M(70) );
- R( e, a, b, c, d, F4, K4, M(71) );
- R( d, e, a, b, c, F4, K4, M(72) );
- R( c, d, e, a, b, F4, K4, M(73) );
- R( b, c, d, e, a, F4, K4, M(74) );
- R( a, b, c, d, e, F4, K4, M(75) );
- R( e, a, b, c, d, F4, K4, M(76) );
- R( d, e, a, b, c, F4, K4, M(77) );
- R( c, d, e, a, b, F4, K4, M(78) );
- R( b, c, d, e, a, F4, K4, M(79) );
+ R( a, b, c, d, e, F1, K1, X, 0);
+ R( e, a, b, c, d, F1, K1, X, 1);
+ R( d, e, a, b, c, F1, K1, X, 2);
+ R( c, d, e, a, b, F1, K1, X, 3);
+ R( b, c, d, e, a, F1, K1, X, 4);
+ R( a, b, c, d, e, F1, K1, X, 5);
+ R( e, a, b, c, d, F1, K1, X, 6);
+ R( d, e, a, b, c, F1, K1, X, 7);
+ R( c, d, e, a, b, F1, K1, X, 8);
+ R( b, c, d, e, a, F1, K1, X, 9);
+ R( a, b, c, d, e, F1, K1, X, 10);
+ R( e, a, b, c, d, F1, K1, X, 11);
+ R( d, e, a, b, c, F1, K1, X, 12);
+ R( c, d, e, a, b, F1, K1, X, 13);
+ R( b, c, d, e, a, F1, K1, X, 14);
+ R( a, b, c, d, e, F1, K1, X, 15);
+ R( e, a, b, c, d, F1, K1, M, 16);
+ R( d, e, a, b, c, F1, K1, M, 17);
+ R( c, d, e, a, b, F1, K1, M, 18);
+ R( b, c, d, e, a, F1, K1, M, 19);
+ R( a, b, c, d, e, F2, K2, M, 20);
+ R( e, a, b, c, d, F2, K2, M, 21);
+ R( d, e, a, b, c, F2, K2, M, 22);
+ R( c, d, e, a, b, F2, K2, M, 23);
+ R( b, c, d, e, a, F2, K2, M, 24);
+ R( a, b, c, d, e, F2, K2, M, 25);
+ R( e, a, b, c, d, F2, K2, M, 26);
+ R( d, e, a, b, c, F2, K2, M, 27);
+ R( c, d, e, a, b, F2, K2, M, 28);
+ R( b, c, d, e, a, F2, K2, M, 29);
+ R( a, b, c, d, e, F2, K2, M, 30);
+ R( e, a, b, c, d, F2, K2, M, 31);
+ R( d, e, a, b, c, F2, K2, M, 32);
+ R( c, d, e, a, b, F2, K2, M, 33);
+ R( b, c, d, e, a, F2, K2, M, 34);
+ R( a, b, c, d, e, F2, K2, M, 35);
+ R( e, a, b, c, d, F2, K2, M, 36);
+ R( d, e, a, b, c, F2, K2, M, 37);
+ R( c, d, e, a, b, F2, K2, M, 38);
+ R( b, c, d, e, a, F2, K2, M, 39);
+ R( a, b, c, d, e, F3, K3, M, 40);
+ R( e, a, b, c, d, F3, K3, M, 41);
+ R( d, e, a, b, c, F3, K3, M, 42);
+ R( c, d, e, a, b, F3, K3, M, 43);
+ R( b, c, d, e, a, F3, K3, M, 44);
+ R( a, b, c, d, e, F3, K3, M, 45);
+ R( e, a, b, c, d, F3, K3, M, 46);
+ R( d, e, a, b, c, F3, K3, M, 47);
+ R( c, d, e, a, b, F3, K3, M, 48);
+ R( b, c, d, e, a, F3, K3, M, 49);
+ R( a, b, c, d, e, F3, K3, M, 50);
+ R( e, a, b, c, d, F3, K3, M, 51);
+ R( d, e, a, b, c, F3, K3, M, 52);
+ R( c, d, e, a, b, F3, K3, M, 53);
+ R( b, c, d, e, a, F3, K3, M, 54);
+ R( a, b, c, d, e, F3, K3, M, 55);
+ R( e, a, b, c, d, F3, K3, M, 56);
+ R( d, e, a, b, c, F3, K3, M, 57);
+ R( c, d, e, a, b, F3, K3, M, 58);
+ R( b, c, d, e, a, F3, K3, M, 59);
+ R( a, b, c, d, e, F4, K4, M, 60);
+ R( e, a, b, c, d, F4, K4, M, 61);
+ R( d, e, a, b, c, F4, K4, M, 62);
+ R( c, d, e, a, b, F4, K4, M, 63);
+ R( b, c, d, e, a, F4, K4, M, 64);
+ R( a, b, c, d, e, F4, K4, M, 65);
+ R( e, a, b, c, d, F4, K4, M, 66);
+ R( d, e, a, b, c, F4, K4, M, 67);
+ R( c, d, e, a, b, F4, K4, M, 68);
+ R( b, c, d, e, a, F4, K4, M, 69);
+ R( a, b, c, d, e, F4, K4, M, 70);
+ R( e, a, b, c, d, F4, K4, M, 71);
+ R( d, e, a, b, c, F4, K4, M, 72);
+ R( c, d, e, a, b, F4, K4, M, 73);
+ R( b, c, d, e, a, F4, K4, M, 74);
+ R( a, b, c, d, e, F4, K4, M, 75);
+ R( e, a, b, c, d, F4, K4, M, 76);
+ R( d, e, a, b, c, F4, K4, M, 77);
+ R( c, d, e, a, b, F4, K4, M, 78);
+ R( b, c, d, e, a, F4, K4, M, 79);
a = ctx->A += a;
b = ctx->B += b;
c = ctx->C += c;
d = ctx->D += d;
e = ctx->E += e;
+ words += 16;
}
}
--- lib/sha256.c.orig 2011-09-09 23:08:04.521357998 +0200
+++ lib/sha256.c 2011-09-12 16:32:42.184202007 +0200
@@ -27,6 +27,7 @@
#include <stddef.h>
#include <stdlib.h>
#include <string.h>
+#include <arpa/inet.h>
#if USE_UNLOCKED_IO
# include "unlocked-io.h"
@@ -468,94 +469,88 @@
#define SS0(x) (rol(x,30)^rol(x,19)^rol(x,10))
#define SS1(x) (rol(x,26)^rol(x,21)^rol(x,7))
-#define M(I) ( tm = S1(x[(I-2)&0x0f]) + x[(I-7)&0x0f] \
- + S0(x[(I-15)&0x0f]) + x[I&0x0f] \
- , x[I&0x0f] = tm )
-
-#define R(A,B,C,D,E,F,G,H,K,M) do { t0 = SS0(A) + F2(A,B,C); \
- t1 = H + SS1(E) \
- + F1(E,F,G) \
- + K \
- + M; \
- D += t1; H = t0 + t1; \
- } while(0)
-
+#define X(I) (ntohl(words[I]))
+#define M(I) ( S1(x[(I-2)&0x0f]) + x[(I-7)&0x0f] \
+ + S0(x[(I-15)&0x0f]) + x[I&0x0f] \
+ )
+
+#define R(A,B,C,D,E,F,G,H,M,I) do { \
+ uint32_t t1; \
+ uint32_t tm = M(I); \
+ x[I&0x0f] = tm; \
+ t1 = H + SS1(E) \
+ + F1(E,F,G) \
+ + K(I) \
+ + tm; \
+ D += t1; \
+ H = SS0(A) + F2(A,B,C) + t1; \
+ } while(0)
while (words < endp)
{
- uint32_t tm;
- uint32_t t0, t1;
- int t;
- /* FIXME: see sha1.c for a better implementation. */
- for (t = 0; t < 16; t++)
- {
- x[t] = SWAP (*words);
- words++;
- }
-
- R( a, b, c, d, e, f, g, h, K( 0), x[ 0] );
- R( h, a, b, c, d, e, f, g, K( 1), x[ 1] );
- R( g, h, a, b, c, d, e, f, K( 2), x[ 2] );
- R( f, g, h, a, b, c, d, e, K( 3), x[ 3] );
- R( e, f, g, h, a, b, c, d, K( 4), x[ 4] );
- R( d, e, f, g, h, a, b, c, K( 5), x[ 5] );
- R( c, d, e, f, g, h, a, b, K( 6), x[ 6] );
- R( b, c, d, e, f, g, h, a, K( 7), x[ 7] );
- R( a, b, c, d, e, f, g, h, K( 8), x[ 8] );
- R( h, a, b, c, d, e, f, g, K( 9), x[ 9] );
- R( g, h, a, b, c, d, e, f, K(10), x[10] );
- R( f, g, h, a, b, c, d, e, K(11), x[11] );
- R( e, f, g, h, a, b, c, d, K(12), x[12] );
- R( d, e, f, g, h, a, b, c, K(13), x[13] );
- R( c, d, e, f, g, h, a, b, K(14), x[14] );
- R( b, c, d, e, f, g, h, a, K(15), x[15] );
- R( a, b, c, d, e, f, g, h, K(16), M(16) );
- R( h, a, b, c, d, e, f, g, K(17), M(17) );
- R( g, h, a, b, c, d, e, f, K(18), M(18) );
- R( f, g, h, a, b, c, d, e, K(19), M(19) );
- R( e, f, g, h, a, b, c, d, K(20), M(20) );
- R( d, e, f, g, h, a, b, c, K(21), M(21) );
- R( c, d, e, f, g, h, a, b, K(22), M(22) );
- R( b, c, d, e, f, g, h, a, K(23), M(23) );
- R( a, b, c, d, e, f, g, h, K(24), M(24) );
- R( h, a, b, c, d, e, f, g, K(25), M(25) );
- R( g, h, a, b, c, d, e, f, K(26), M(26) );
- R( f, g, h, a, b, c, d, e, K(27), M(27) );
- R( e, f, g, h, a, b, c, d, K(28), M(28) );
- R( d, e, f, g, h, a, b, c, K(29), M(29) );
- R( c, d, e, f, g, h, a, b, K(30), M(30) );
- R( b, c, d, e, f, g, h, a, K(31), M(31) );
- R( a, b, c, d, e, f, g, h, K(32), M(32) );
- R( h, a, b, c, d, e, f, g, K(33), M(33) );
- R( g, h, a, b, c, d, e, f, K(34), M(34) );
- R( f, g, h, a, b, c, d, e, K(35), M(35) );
- R( e, f, g, h, a, b, c, d, K(36), M(36) );
- R( d, e, f, g, h, a, b, c, K(37), M(37) );
- R( c, d, e, f, g, h, a, b, K(38), M(38) );
- R( b, c, d, e, f, g, h, a, K(39), M(39) );
- R( a, b, c, d, e, f, g, h, K(40), M(40) );
- R( h, a, b, c, d, e, f, g, K(41), M(41) );
- R( g, h, a, b, c, d, e, f, K(42), M(42) );
- R( f, g, h, a, b, c, d, e, K(43), M(43) );
- R( e, f, g, h, a, b, c, d, K(44), M(44) );
- R( d, e, f, g, h, a, b, c, K(45), M(45) );
- R( c, d, e, f, g, h, a, b, K(46), M(46) );
- R( b, c, d, e, f, g, h, a, K(47), M(47) );
- R( a, b, c, d, e, f, g, h, K(48), M(48) );
- R( h, a, b, c, d, e, f, g, K(49), M(49) );
- R( g, h, a, b, c, d, e, f, K(50), M(50) );
- R( f, g, h, a, b, c, d, e, K(51), M(51) );
- R( e, f, g, h, a, b, c, d, K(52), M(52) );
- R( d, e, f, g, h, a, b, c, K(53), M(53) );
- R( c, d, e, f, g, h, a, b, K(54), M(54) );
- R( b, c, d, e, f, g, h, a, K(55), M(55) );
- R( a, b, c, d, e, f, g, h, K(56), M(56) );
- R( h, a, b, c, d, e, f, g, K(57), M(57) );
- R( g, h, a, b, c, d, e, f, K(58), M(58) );
- R( f, g, h, a, b, c, d, e, K(59), M(59) );
- R( e, f, g, h, a, b, c, d, K(60), M(60) );
- R( d, e, f, g, h, a, b, c, K(61), M(61) );
- R( c, d, e, f, g, h, a, b, K(62), M(62) );
- R( b, c, d, e, f, g, h, a, K(63), M(63) );
+ R( a, b, c, d, e, f, g, h, X, 0 );
+ R( h, a, b, c, d, e, f, g, X, 1 );
+ R( g, h, a, b, c, d, e, f, X, 2 );
+ R( f, g, h, a, b, c, d, e, X, 3 );
+ R( e, f, g, h, a, b, c, d, X, 4 );
+ R( d, e, f, g, h, a, b, c, X, 5 );
+ R( c, d, e, f, g, h, a, b, X, 6 );
+ R( b, c, d, e, f, g, h, a, X, 7 );
+ R( a, b, c, d, e, f, g, h, X, 8 );
+ R( h, a, b, c, d, e, f, g, X, 9 );
+ R( g, h, a, b, c, d, e, f, X, 10 );
+ R( f, g, h, a, b, c, d, e, X, 11 );
+ R( e, f, g, h, a, b, c, d, X, 12 );
+ R( d, e, f, g, h, a, b, c, X, 13 );
+ R( c, d, e, f, g, h, a, b, X, 14 );
+ R( b, c, d, e, f, g, h, a, X, 15 );
+ R( a, b, c, d, e, f, g, h, M, 16 );
+ R( h, a, b, c, d, e, f, g, M, 17 );
+ R( g, h, a, b, c, d, e, f, M, 18 );
+ R( f, g, h, a, b, c, d, e, M, 19 );
+ R( e, f, g, h, a, b, c, d, M, 20 );
+ R( d, e, f, g, h, a, b, c, M, 21 );
+ R( c, d, e, f, g, h, a, b, M, 22 );
+ R( b, c, d, e, f, g, h, a, M, 23 );
+ R( a, b, c, d, e, f, g, h, M, 24 );
+ R( h, a, b, c, d, e, f, g, M, 25 );
+ R( g, h, a, b, c, d, e, f, M, 26 );
+ R( f, g, h, a, b, c, d, e, M, 27 );
+ R( e, f, g, h, a, b, c, d, M, 28 );
+ R( d, e, f, g, h, a, b, c, M, 29 );
+ R( c, d, e, f, g, h, a, b, M, 30 );
+ R( b, c, d, e, f, g, h, a, M, 31 );
+ R( a, b, c, d, e, f, g, h, M, 32 );
+ R( h, a, b, c, d, e, f, g, M, 33 );
+ R( g, h, a, b, c, d, e, f, M, 34 );
+ R( f, g, h, a, b, c, d, e, M, 35 );
+ R( e, f, g, h, a, b, c, d, M, 36 );
+ R( d, e, f, g, h, a, b, c, M, 37 );
+ R( c, d, e, f, g, h, a, b, M, 38 );
+ R( b, c, d, e, f, g, h, a, M, 39 );
+ R( a, b, c, d, e, f, g, h, M, 40 );
+ R( h, a, b, c, d, e, f, g, M, 41 );
+ R( g, h, a, b, c, d, e, f, M, 42 );
+ R( f, g, h, a, b, c, d, e, M, 43 );
+ R( e, f, g, h, a, b, c, d, M, 44 );
+ R( d, e, f, g, h, a, b, c, M, 45 );
+ R( c, d, e, f, g, h, a, b, M, 46 );
+ R( b, c, d, e, f, g, h, a, M, 47 );
+ R( a, b, c, d, e, f, g, h, M, 48 );
+ R( h, a, b, c, d, e, f, g, M, 49 );
+ R( g, h, a, b, c, d, e, f, M, 50 );
+ R( f, g, h, a, b, c, d, e, M, 51 );
+ R( e, f, g, h, a, b, c, d, M, 52 );
+ R( d, e, f, g, h, a, b, c, M, 53 );
+ R( c, d, e, f, g, h, a, b, M, 54 );
+ R( b, c, d, e, f, g, h, a, M, 55 );
+ R( a, b, c, d, e, f, g, h, M, 56 );
+ R( h, a, b, c, d, e, f, g, M, 57 );
+ R( g, h, a, b, c, d, e, f, M, 58 );
+ R( f, g, h, a, b, c, d, e, M, 59 );
+ R( e, f, g, h, a, b, c, d, M, 60 );
+ R( d, e, f, g, h, a, b, c, M, 61 );
+ R( c, d, e, f, g, h, a, b, M, 62 );
+ R( b, c, d, e, f, g, h, a, M, 63 );
a = ctx->state[0] += a;
b = ctx->state[1] += b;
@@ -565,5 +560,6 @@
f = ctx->state[5] += f;
g = ctx->state[6] += g;
h = ctx->state[7] += h;
+ words += 16;
}
}
--- lib/sha512.c.orig 2011-09-06 15:24:17.320209997 +0200
+++ lib/sha512.c 2011-09-12 16:32:42.604202003 +0200
@@ -27,6 +27,7 @@
#include <stddef.h>
#include <stdlib.h>
#include <string.h>
+#include <endian.h>
#if USE_UNLOCKED_IO
# include "unlocked-io.h"
@@ -383,6 +384,7 @@
#if !_STRING_ARCH_unaligned
# define alignof(type) offsetof (struct { char c; type x; }, x)
# define UNALIGNED_P(p) (((size_t) p) % alignof (u64) != 0)
+#error "unaligned"
if (UNALIGNED_P (buffer))
while (len > 128)
{
@@ -498,19 +500,29 @@
#define SS0(x) u64xor (u64rol (x, 36), u64xor (u64rol (x, 30), u64rol (x, 25)))
#define SS1(x) u64xor (u64rol(x, 50), u64xor (u64rol (x, 46), u64rol (x, 23)))
-#define M(I) (x[(I) & 15] \
- = u64plus (x[(I) & 15], \
- u64plus (S1 (x[((I) - 2) & 15]), \
- u64plus (x[((I) - 7) & 15], \
- S0 (x[((I) - 15) & 15])))))
+#if __WORDSIZE == 64
+#define setX(I, val) (*(volatile u64 *)&x[(I)&15] = (val))
+// be64toh in define in endian.h
+#define X(I) (be64toh(words[I]))
+#else
+#define setX(I, val) (x[(I)&15] = (val))
+#define X(I) (SWAP(words[I]))
+#endif
-#define R(A, B, C, D, E, F, G, H, K, M) \
+#define M(I) (u64plus (x[(I) & 15], \
+ u64plus (S1 (x[((I) - 2) & 15]), \
+ u64plus (x[((I) - 7) & 15], \
+ S0 (x[((I) - 15) & 15])))))
+#define R(A, B, C, D, E, F, G, H, M, I) \
do \
{ \
+ u64 temp = M(I); \
u64 t0 = u64plus (SS0 (A), F2 (A, B, C)); \
u64 t1 = \
u64plus (H, u64plus (SS1 (E), \
- u64plus (F1 (E, F, G), u64plus (K, M)))); \
+ u64plus (F1 (E, F, G), \
+ u64plus (K(I), temp)))); \
+ setX(I, temp); \
D = u64plus (D, t1); \
H = u64plus (t0, t1); \
} \
@@ -518,94 +530,86 @@
while (words < endp)
{
- int t;
- /* FIXME: see sha1.c for a better implementation. */
- for (t = 0; t < 16; t++)
- {
- x[t] = SWAP (*words);
- words++;
- }
-
- R( a, b, c, d, e, f, g, h, K( 0), x[ 0] );
- R( h, a, b, c, d, e, f, g, K( 1), x[ 1] );
- R( g, h, a, b, c, d, e, f, K( 2), x[ 2] );
- R( f, g, h, a, b, c, d, e, K( 3), x[ 3] );
- R( e, f, g, h, a, b, c, d, K( 4), x[ 4] );
- R( d, e, f, g, h, a, b, c, K( 5), x[ 5] );
- R( c, d, e, f, g, h, a, b, K( 6), x[ 6] );
- R( b, c, d, e, f, g, h, a, K( 7), x[ 7] );
- R( a, b, c, d, e, f, g, h, K( 8), x[ 8] );
- R( h, a, b, c, d, e, f, g, K( 9), x[ 9] );
- R( g, h, a, b, c, d, e, f, K(10), x[10] );
- R( f, g, h, a, b, c, d, e, K(11), x[11] );
- R( e, f, g, h, a, b, c, d, K(12), x[12] );
- R( d, e, f, g, h, a, b, c, K(13), x[13] );
- R( c, d, e, f, g, h, a, b, K(14), x[14] );
- R( b, c, d, e, f, g, h, a, K(15), x[15] );
- R( a, b, c, d, e, f, g, h, K(16), M(16) );
- R( h, a, b, c, d, e, f, g, K(17), M(17) );
- R( g, h, a, b, c, d, e, f, K(18), M(18) );
- R( f, g, h, a, b, c, d, e, K(19), M(19) );
- R( e, f, g, h, a, b, c, d, K(20), M(20) );
- R( d, e, f, g, h, a, b, c, K(21), M(21) );
- R( c, d, e, f, g, h, a, b, K(22), M(22) );
- R( b, c, d, e, f, g, h, a, K(23), M(23) );
- R( a, b, c, d, e, f, g, h, K(24), M(24) );
- R( h, a, b, c, d, e, f, g, K(25), M(25) );
- R( g, h, a, b, c, d, e, f, K(26), M(26) );
- R( f, g, h, a, b, c, d, e, K(27), M(27) );
- R( e, f, g, h, a, b, c, d, K(28), M(28) );
- R( d, e, f, g, h, a, b, c, K(29), M(29) );
- R( c, d, e, f, g, h, a, b, K(30), M(30) );
- R( b, c, d, e, f, g, h, a, K(31), M(31) );
- R( a, b, c, d, e, f, g, h, K(32), M(32) );
- R( h, a, b, c, d, e, f, g, K(33), M(33) );
- R( g, h, a, b, c, d, e, f, K(34), M(34) );
- R( f, g, h, a, b, c, d, e, K(35), M(35) );
- R( e, f, g, h, a, b, c, d, K(36), M(36) );
- R( d, e, f, g, h, a, b, c, K(37), M(37) );
- R( c, d, e, f, g, h, a, b, K(38), M(38) );
- R( b, c, d, e, f, g, h, a, K(39), M(39) );
- R( a, b, c, d, e, f, g, h, K(40), M(40) );
- R( h, a, b, c, d, e, f, g, K(41), M(41) );
- R( g, h, a, b, c, d, e, f, K(42), M(42) );
- R( f, g, h, a, b, c, d, e, K(43), M(43) );
- R( e, f, g, h, a, b, c, d, K(44), M(44) );
- R( d, e, f, g, h, a, b, c, K(45), M(45) );
- R( c, d, e, f, g, h, a, b, K(46), M(46) );
- R( b, c, d, e, f, g, h, a, K(47), M(47) );
- R( a, b, c, d, e, f, g, h, K(48), M(48) );
- R( h, a, b, c, d, e, f, g, K(49), M(49) );
- R( g, h, a, b, c, d, e, f, K(50), M(50) );
- R( f, g, h, a, b, c, d, e, K(51), M(51) );
- R( e, f, g, h, a, b, c, d, K(52), M(52) );
- R( d, e, f, g, h, a, b, c, K(53), M(53) );
- R( c, d, e, f, g, h, a, b, K(54), M(54) );
- R( b, c, d, e, f, g, h, a, K(55), M(55) );
- R( a, b, c, d, e, f, g, h, K(56), M(56) );
- R( h, a, b, c, d, e, f, g, K(57), M(57) );
- R( g, h, a, b, c, d, e, f, K(58), M(58) );
- R( f, g, h, a, b, c, d, e, K(59), M(59) );
- R( e, f, g, h, a, b, c, d, K(60), M(60) );
- R( d, e, f, g, h, a, b, c, K(61), M(61) );
- R( c, d, e, f, g, h, a, b, K(62), M(62) );
- R( b, c, d, e, f, g, h, a, K(63), M(63) );
- R( a, b, c, d, e, f, g, h, K(64), M(64) );
- R( h, a, b, c, d, e, f, g, K(65), M(65) );
- R( g, h, a, b, c, d, e, f, K(66), M(66) );
- R( f, g, h, a, b, c, d, e, K(67), M(67) );
- R( e, f, g, h, a, b, c, d, K(68), M(68) );
- R( d, e, f, g, h, a, b, c, K(69), M(69) );
- R( c, d, e, f, g, h, a, b, K(70), M(70) );
- R( b, c, d, e, f, g, h, a, K(71), M(71) );
- R( a, b, c, d, e, f, g, h, K(72), M(72) );
- R( h, a, b, c, d, e, f, g, K(73), M(73) );
- R( g, h, a, b, c, d, e, f, K(74), M(74) );
- R( f, g, h, a, b, c, d, e, K(75), M(75) );
- R( e, f, g, h, a, b, c, d, K(76), M(76) );
- R( d, e, f, g, h, a, b, c, K(77), M(77) );
- R( c, d, e, f, g, h, a, b, K(78), M(78) );
- R( b, c, d, e, f, g, h, a, K(79), M(79) );
+ R( a, b, c, d, e, f, g, h, X, 0 );
+ R( h, a, b, c, d, e, f, g, X, 1 );
+ R( g, h, a, b, c, d, e, f, X, 2 );
+ R( f, g, h, a, b, c, d, e, X, 3 );
+ R( e, f, g, h, a, b, c, d, X, 4 );
+ R( d, e, f, g, h, a, b, c, X, 5 );
+ R( c, d, e, f, g, h, a, b, X, 6 );
+ R( b, c, d, e, f, g, h, a, X, 7 );
+ R( a, b, c, d, e, f, g, h, X, 8 );
+ R( h, a, b, c, d, e, f, g, X, 9 );
+ R( g, h, a, b, c, d, e, f, X, 10 );
+ R( f, g, h, a, b, c, d, e, X, 11 );
+ R( e, f, g, h, a, b, c, d, X, 12 );
+ R( d, e, f, g, h, a, b, c, X, 13 );
+ R( c, d, e, f, g, h, a, b, X, 14 );
+ R( b, c, d, e, f, g, h, a, X, 15 );
+ R( a, b, c, d, e, f, g, h, M, 16 );
+ R( h, a, b, c, d, e, f, g, M, 17 );
+ R( g, h, a, b, c, d, e, f, M, 18 );
+ R( f, g, h, a, b, c, d, e, M, 19 );
+ R( e, f, g, h, a, b, c, d, M, 20 );
+ R( d, e, f, g, h, a, b, c, M, 21 );
+ R( c, d, e, f, g, h, a, b, M, 22 );
+ R( b, c, d, e, f, g, h, a, M, 23 );
+ R( a, b, c, d, e, f, g, h, M, 24 );
+ R( h, a, b, c, d, e, f, g, M, 25 );
+ R( g, h, a, b, c, d, e, f, M, 26 );
+ R( f, g, h, a, b, c, d, e, M, 27 );
+ R( e, f, g, h, a, b, c, d, M, 28 );
+ R( d, e, f, g, h, a, b, c, M, 29 );
+ R( c, d, e, f, g, h, a, b, M, 30 );
+ R( b, c, d, e, f, g, h, a, M, 31 );
+ R( a, b, c, d, e, f, g, h, M, 32 );
+ R( h, a, b, c, d, e, f, g, M, 33 );
+ R( g, h, a, b, c, d, e, f, M, 34 );
+ R( f, g, h, a, b, c, d, e, M, 35 );
+ R( e, f, g, h, a, b, c, d, M, 36 );
+ R( d, e, f, g, h, a, b, c, M, 37 );
+ R( c, d, e, f, g, h, a, b, M, 38 );
+ R( b, c, d, e, f, g, h, a, M, 39 );
+ R( a, b, c, d, e, f, g, h, M, 40 );
+ R( h, a, b, c, d, e, f, g, M, 41 );
+ R( g, h, a, b, c, d, e, f, M, 42 );
+ R( f, g, h, a, b, c, d, e, M, 43 );
+ R( e, f, g, h, a, b, c, d, M, 44 );
+ R( d, e, f, g, h, a, b, c, M, 45 );
+ R( c, d, e, f, g, h, a, b, M, 46 );
+ R( b, c, d, e, f, g, h, a, M, 47 );
+ R( a, b, c, d, e, f, g, h, M, 48 );
+ R( h, a, b, c, d, e, f, g, M, 49 );
+ R( g, h, a, b, c, d, e, f, M, 50 );
+ R( f, g, h, a, b, c, d, e, M, 51 );
+ R( e, f, g, h, a, b, c, d, M, 52 );
+ R( d, e, f, g, h, a, b, c, M, 53 );
+ R( c, d, e, f, g, h, a, b, M, 54 );
+ R( b, c, d, e, f, g, h, a, M, 55 );
+ R( a, b, c, d, e, f, g, h, M, 56 );
+ R( h, a, b, c, d, e, f, g, M, 57 );
+ R( g, h, a, b, c, d, e, f, M, 58 );
+ R( f, g, h, a, b, c, d, e, M, 59 );
+ R( e, f, g, h, a, b, c, d, M, 60 );
+ R( d, e, f, g, h, a, b, c, M, 61 );
+ R( c, d, e, f, g, h, a, b, M, 62 );
+ R( b, c, d, e, f, g, h, a, M, 63 );
+ R( a, b, c, d, e, f, g, h, M, 64 );
+ R( h, a, b, c, d, e, f, g, M, 65 );
+ R( g, h, a, b, c, d, e, f, M, 66 );
+ R( f, g, h, a, b, c, d, e, M, 67 );
+ R( e, f, g, h, a, b, c, d, M, 68 );
+ R( d, e, f, g, h, a, b, c, M, 69 );
+ R( c, d, e, f, g, h, a, b, M, 70 );
+ R( b, c, d, e, f, g, h, a, M, 71 );
+ R( a, b, c, d, e, f, g, h, M, 72 );
+ R( h, a, b, c, d, e, f, g, M, 73 );
+ R( g, h, a, b, c, d, e, f, M, 74 );
+ R( f, g, h, a, b, c, d, e, M, 75 );
+ R( e, f, g, h, a, b, c, d, M, 76 );
+ R( d, e, f, g, h, a, b, c, M, 77 );
+ R( c, d, e, f, g, h, a, b, M, 78 );
+ R( b, c, d, e, f, g, h, a, M, 79 );
a = ctx->state[0] = u64plus (ctx->state[0], a);
b = ctx->state[1] = u64plus (ctx->state[1], b);
@@ -615,5 +619,6 @@
f = ctx->state[5] = u64plus (ctx->state[5], f);
g = ctx->state[6] = u64plus (ctx->state[6], g);
h = ctx->state[7] = u64plus (ctx->state[7], h);
+ words += 16;
}
}
./sha1sum.orig
3.54user 0.10system 0:03.65elapsed 99%CPU (0avgtext+0avgdata 2816maxresident)k
3.50user 0.14system 0:03.65elapsed 99%CPU (0avgtext+0avgdata 2816maxresident)k
3.50user 0.11system 0:03.61elapsed 99%CPU (0avgtext+0avgdata 2816maxresident)k
./sha1sum.v1
2.59user 0.11system 0:02.71elapsed 99%CPU (0avgtext+0avgdata 2832maxresident)k
2.59user 0.11system 0:02.70elapsed 99%CPU (0avgtext+0avgdata 2816maxresident)k
2.60user 0.11system 0:02.72elapsed 99%CPU (0avgtext+0avgdata 2832maxresident)k
./sha1sum.m32.orig
5.15user 0.10system 0:05.26elapsed 99%CPU (0avgtext+0avgdata 2528maxresident)k
5.15user 0.11system 0:05.27elapsed 99%CPU (0avgtext+0avgdata 2544maxresident)k
5.08user 0.16system 0:05.25elapsed 99%CPU (0avgtext+0avgdata 2544maxresident)k
./sha1sum.m32.v1
2.93user 0.14system 0:03.08elapsed 99%CPU (0avgtext+0avgdata 2512maxresident)k
2.90user 0.18system 0:03.09elapsed 99%CPU (0avgtext+0avgdata 2512maxresident)k
3.00user 0.08system 0:03.09elapsed 99%CPU (0avgtext+0avgdata 2528maxresident)k
./sha256sum.orig
6.44user 0.12system 0:06.58elapsed 99%CPU (0avgtext+0avgdata 2848maxresident)k
6.50user 0.08system 0:06.59elapsed 99%CPU (0avgtext+0avgdata 2864maxresident)k
6.48user 0.14system 0:06.63elapsed 99%CPU (0avgtext+0avgdata 2864maxresident)k
./sha256sum.v1
6.09user 0.10system 0:06.20elapsed 99%CPU (0avgtext+0avgdata 2848maxresident)k
6.10user 0.08system 0:06.19elapsed 99%CPU (0avgtext+0avgdata 2848maxresident)k
6.05user 0.13system 0:06.19elapsed 99%CPU (0avgtext+0avgdata 2848maxresident)k
./sha256sum.m32.orig
7.07user 0.13system 0:07.21elapsed 99%CPU (0avgtext+0avgdata 2576maxresident)k
7.09user 0.12system 0:07.22elapsed 99%CPU (0avgtext+0avgdata 2576maxresident)k
7.06user 0.13system 0:07.20elapsed 99%CPU (0avgtext+0avgdata 2560maxresident)k
./sha256sum.m32.v1
6.47user 0.13system 0:06.61elapsed 99%CPU (0avgtext+0avgdata 2544maxresident)k
6.53user 0.11system 0:06.65elapsed 99%CPU (0avgtext+0avgdata 2560maxresident)k
6.48user 0.19system 0:06.68elapsed 99%CPU (0avgtext+0avgdata 2560maxresident)k
./sha512sum.orig
4.43user 0.12system 0:04.55elapsed 99%CPU (0avgtext+0avgdata 2928maxresident)k
4.45user 0.11system 0:04.56elapsed 99%CPU (0avgtext+0avgdata 2928maxresident)k
4.48user 0.09system 0:04.58elapsed 99%CPU (0avgtext+0avgdata 2912maxresident)k
./sha512sum.v1
4.05user 0.11system 0:04.17elapsed 99%CPU (0avgtext+0avgdata 2928maxresident)k
4.11user 0.06system 0:04.18elapsed 99%CPU (0avgtext+0avgdata 2928maxresident)k
4.15user 0.14system 0:04.29elapsed 99%CPU (0avgtext+0avgdata 2928maxresident)k
./sha512sum.m32.orig
25.23user 0.20system 0:25.46elapsed 99%CPU (0avgtext+0avgdata 2800maxresident)k
25.22user 0.11system 0:25.36elapsed 99%CPU (0avgtext+0avgdata 2800maxresident)k
25.13user 0.15system 0:25.32elapsed 99%CPU (0avgtext+0avgdata 2800maxresident)k
./sha512sum.m32.v1
22.13user 0.14system 0:22.30elapsed 99%CPU (0avgtext+0avgdata 2784maxresident)k
22.09user 0.21system 0:22.33elapsed 99%CPU (0avgtext+0avgdata 2784maxresident)k
22.19user 0.13system 0:22.35elapsed 99%CPU (0avgtext+0avgdata 2800maxresident)k