Re: drivers/char/random.c: More futzing about
Just to add to my total confusion about the totally disparate performance numbers we're seeing, I did some benchmarks on other machines. The speedup isn't as good one-pass as it is iterated, and as I mentioned it's slower on a P4, but it's not 7 times slower by any stretch. There are all 1-iteration numbers, run immediately after scp-ing the binary to the machine so there's no possibility if anything being cached. (The "64" and "32" versions are compiled -m32 and -m64, of course.) 2.5 GHz Phenom 9850: $ /tmp/random64 pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0:199142 (-57) 1:104 95 (-9) 2:104110 (+6) 3:103109 (+6) 4:105 89 (-16) 5:103 88 (-15) 6:104 89 (-15) 7:104 95 (-9) 8:105 85 (-20) 9:105 85 (-20) $ /tmp/random32 pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0:324147 (-177) 1:100 86 (-14) 2:100 99 (-1) 3:100 88 (-12) 4:100 86 (-14) 5:100 86 (-14) 6:100 89 (-11) 7:100111 (+11) 8:100111 (+11) 9:100 88 (-12) $ /tmp/random64 1 pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1 0: 554788 220327 (-334461) 1: 554825 220176 (-334649) 2: 553505 220148 (-57) 3: 554661 220064 (-334597) 4: 569559 220064 (-349495) 5: 612798 220065 (-392733) 6: 570287 220064 (-350223) 7: 554790 220064 (-334726) 8: 554715 220065 (-334650) 9: 569840 220064 (-349776) $ /tmp/random32 1 pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1 0: 520117 280225 (-239892) 1: 520125 280154 (-239971) 2: 520104 280094 (-240010) 3: 520079 280060 (-240019) 4: 520069 280060 (-240009) 5: 520060 280060 (-24) 6: 558971 280060 (-278911) 7: 520102 280060 (-240042) 8: 520082 280060 (-240022) 9: 520058 280060 (-239998) 3 GHz i5-3330: $ /tmp/random64 pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0: 78 75 (-3) 1: 36 33 (-3) 2: 33 39 (+6) 3: 36 30 (-6) 4: 36 33 (-3) 5: 30 33 (+3) 6: 30 54 (+24) 7: 24 48 (+24) 8: 27 33 (+6) 9: 30 33 (+3) $ /tmp/random32 pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0: 66 78 (+12) 1: 39 39 (+0) 2: 36 39 (+3) 3: 45 33 (-12) 4: 42 33 (-9) 5: 33 42 (+9) 6: 45 33 (-12) 7: 39 36 (-3) 8:105 48 (-57) 9: 42 39 (-3) $ /tmp/random64 1 pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1 0: 406188 218104 (-188084) 1: 402620 246968 (-155652) 2: 402652 239840 (-162812) 3: 402720 200312 (-202408) 4: 402584 200080 (-202504) 5: 447488 200228 (-247260) 6: 402788 200312 (-202476) 7: 402688 200080 (-202608) 8: 427140 224320 (-202820) 9: 402576 200080 (-202496) $ /tmp/random32 1 pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1 0: 406485 266670 (-139815) 1: 392694 266463 (-126231) 2: 392496 266763 (-125733) 3: 426003 266145 (-159858) 4: 392688 27 (-126021) 5: 432231 266589 (-165642) 6: 392754 298734 (-94020) 7: 392883 284994 (-107889) 8: 392637 266694 (-125943) 9: 392985 267024 (-125961) 3.5 GHz i7-2700: # /tmp/perftest /tmp/random64 pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0: 82 90 (+8) 1: 38 41 (+3) 2: 46 38 (-8) 3: 35 41 (+6) 4: 46 41 (-5) 5: 38 38 (+0) 6: 41 55 (+14) 7: 41 35 (-6) 8: 46 24 (-22) 9: 35 38 (+3) # /tmp/perftest /tmp/random32 pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0: 82 76 (-6) 1: 32 53 (+21) 2: 49 44 (-5) 3: 35 41 (+6) 4: 46 35 (-11) 5: 35 44 (+9) 6: 49 50 (+1) 7: 41 41 (+0) 8: 32 44 (+12) 9: 49 44 (-5) # /tmp/perfte
Re: drivers/char/random.c: More futzing about
On Wed, Jun 11, 2014 at 08:32:49PM -0400, George Spelvin wrote: > Comparable, but slightly slower. Clearly, I need to do better. > And you can see the first-iteration effects clearly. Still, > noting *remotely* like 7x! I redid my numbers, and I can no longer reproduce the 7x slowdown. I do see that if you compile w/o -O2, fast_mix2 is twice as slow. But it's not 7x slower. When compiling w/o -O2: fast_mix fast_mix2 task-clock 221.3 ms 460.7 ms When compiling with -O2 -Os: fast_mix fast_mix2 task-clock 115.4 ms 71.5 ms And here's the numbers I got with a single iteration using rdtsc: fast_mix: 164 fast_mix2: 237 fast_mix: 168 fast_mix2: 230 fast_mix: 166 fast_mix2: 228 fast_mix: 164 fast_mix2: 230 fast_mix: 166 fast_mix2: 230 fast_mix: 168 fast_mix2: 232 fast_mix: 166 fast_mix2: 228 fast_mix: 164 fast_mix2: 228 fast_mix: 166 fast_mix2: 234 fast_mix: 166 fast_mix2: 230 - Ted #include #include #include #include typedef unsigned int __u32; struct fast_pool { __u32 pool[4]; unsigned long last; unsigned short count; unsigned char rotate; unsigned char last_timer_intr; }; /** * rol32 - rotate a 32-bit value left * @word: value to rotate * @shift: bits to roll */ static inline __u32 rol32(__u32 word, unsigned int shift) { return (word << shift) | (word >> (32 - shift)); } static __u32 const twist_table[8] = { 0x, 0x3b6e20c8, 0x76dc4190, 0x4db26158, 0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 }; /* * This is a fast mixing routine used by the interrupt randomness * collector. It's hardcoded for an 128 bit pool and assumes that any * locks that might be needed are taken by the caller. */ extern void fast_mix(struct fast_pool *f, __u32 input[4]) { __u32 w; unsignedinput_rotate = f->rotate; w = rol32(input[0], input_rotate) ^ f->pool[0] ^ f->pool[3]; f->pool[0] = (w >> 3) ^ twist_table[w & 7]; input_rotate = (input_rotate + 14) & 31; w = rol32(input[1], input_rotate) ^ f->pool[1] ^ f->pool[0]; f->pool[1] = (w >> 3) ^ twist_table[w & 7]; input_rotate = (input_rotate + 7) & 31; w = rol32(input[2], input_rotate) ^ f->pool[2] ^ f->pool[1]; f->pool[2] = (w >> 3) ^ twist_table[w & 7]; input_rotate = (input_rotate + 7) & 31; w = rol32(input[3], input_rotate) ^ f->pool[3] ^ f->pool[2]; f->pool[3] = (w >> 3) ^ twist_table[w & 7]; input_rotate = (input_rotate + 7) & 31; f->rotate = input_rotate; f->count++; } extern fast_mix2(struct fast_pool *f, __u32 const input[4]) { __u32 a = f->pool[0] ^ input[0], b = f->pool[1] ^ input[1]; __u32 c = f->pool[2] ^ input[2], d = f->pool[3] ^ input[3]; int i; for (i = 0; i < 3; i++) { /* * Inspired by ChaCha's QuarterRound, but * modified for much greater parallelism. * Surprisingly, rotating a and c seems to work * better than b and d. And it runs faster. */ a += b; c += d; d ^= a; b ^= c; a = rol32(a, 15); c = rol32(c, 21); a += b; c += d; d ^= a; b ^= c; a = rol32(a, 3);c = rol32(c, 7); } f->pool[0] = a; f->pool[1] = b; f->pool[2] = c; f->pool[3] = d; f->count++; } static __inline__ volatile unsigned long long rdtsc(void) { unsigned long long int x; __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x)); return x; } int main(int argc, char **argv) { struct fast_pool f; int i; __u32 input[4]; unsigned volatile long long start_time, end_time; memset(&f, 0, sizeof(f)); memset(&input, 0, sizeof(input)); f.pool[0] = 1; #if !defined(BENCH_FASTMIX) && !defined(BENCH_FASTMIX2) for (i=0; i < 10; i++) { usleep(5); start_time = rdtsc(); fast_mix(&f, input); end_time = rdtsc(); printf("fast_mix: %llu\t", end_time - start_time); usleep(5); start_time = rdtsc(); fast_mix2(&f, input); end_time = rdtsc(); printf("fast_mix2: %llu\n", end_time - start_time); } #endif #ifdef BENCH_FASTMIX for (i=0; i < 1024; i++) { fast_mix(&f, input); } #endif #ifdef BENCH_FASTMIX2 for (i=0; i < 1024; i++) { fast_mix2(&f, input); } #endif } -- To unsubscribe from this list:
Re: drivers/char/random.c: More futzing about
> Sadly I can't find the tree, but I'm 94% sure it was Skein-256 > (specifically the SHA3-256 candidate parameter set.) It would be nice to have two hash functions, optimized separately for 32- and 64-bit processors. As the Skein report says, the algorithm can be adapted to 32 bits easily enough. I also did some work a while ago to adapt the Skein parameter search code to develop a Skein-192 (6x32 bits) that would fit into registers on x86-32. (It got stalled when I e-mailed Niels Ferguson about it and never heard back; it fell off the to-do list while I was waiting.) The intended target was IPv6 address hashing for sequence number randomization, but it could be used for pool hashing, too. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: drivers/char/random.c: More futzing about
On 06/11/2014 01:41 PM, H. Peter Anvin wrote: > On 06/11/2014 12:25 PM, Theodore Ts'o wrote: >> On Wed, Jun 11, 2014 at 09:48:31AM -0700, H. Peter Anvin wrote: >>> While talking about performance, I did a quick prototype of random using >>> Skein instead of SHA-1, and it was measurably faster, in part because >>> Skein produces more output per hash. >> >> Which Skein parameters did you use, and how much stack space was >> required for it? Skein-512 is described as needing 200 bytes of >> state, IIRC (which I assume most of which comes from Threefish key >> schedule). >> > > I believe I used Skein-256, but I'd have to dig to find it again. > > -hpa > Sadly I can't find the tree, but I'm 94% sure it was Skein-256 (specifically the SHA3-256 candidate parameter set.) -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: drivers/char/random.c: More futzing about
> ... but how did you measure the "2/3 the time"? I've done some > measurements, using both "time calling fast_mix() and fast_mix2() N > times and divide by N (where N needs to be quite large). Using that > metric, fast_mix2() takes seven times as long to run. Wow, *massively* different results. Using a large iteration count, it's defintiely faster. Here's with 1,000,000 iterations: # ./perftest ./random pool 1 = 6b469698 596ceb8f 70536d2d e262b8ed pool 2 = 72e2554d fd20e020 a51baf43 19472f63 0: 40587696 19952756 (-20634940) 1: 40569444 19986700 (-20582744) 2: 40529396 19983108 (-20546288) 3: 40492468 19959528 (-20532940) 4: 40461808 19977316 (-20484492) 5: 40421980 19977436 (-20444544) 6: 40495440 19959904 (-20535536) 7: 40436676 19975400 (-20461276) 8: 40454912 19971360 (-20483552) 9: 40463260 19957324 (-20505936) Here's with 1 iteration (which you're very right, I failed to test, and instruction counts matter more when code is not hot in cache) # ./perftest ./random pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0: 76 84 (+8) 1: 36 36 (+0) 2: 32 36 (+4) 3: 32 40 (+8) 4: 28 40 (+12) 5: 32 36 (+4) 6: 32 36 (+4) 7: 28 40 (+12) 8: 32 40 (+8) 9: 28 40 (+12) Comparable, but slightly slower. Clearly, I need to do better. And you can see the first-iteration effects clearly. Still, noting *remotely* like 7x! That seems crazy, as the overall operation counts are not that dissimilar. Here's the "perftest" shell wrapper: #!/bin/sh old="`cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`" for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance > $i done "$@" for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo "$old" > $i done And "random.c" is appended. Apologies for the comemnts and #ifdefs I used while experimenting; I wanted to give you *exactly* what I'm running. Do you see a bug in it anywhere? I know P4s have slow rotates, so the larger number of them compared to shifts in the original will definitely cause a hit. But neither of us are using that. Compile line and platform: cc -W -Wall -m64 -O -march=nativerandom.c -o random model name : Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz #include typedef uint32_t u32; static u32 const twist_table[8] = { 0x, 0x3b6e20c8, 0x76dc4190, 0x4db26158, 0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 }; struct fast_pool { u32 pool[4]; unsigned long last; unsigned short count; unsigned char rotate; unsigned char last_timer_intr; }; static inline u32 rol32(u32 word, unsigned int shift) { return (word << shift) | (word >> (32 - shift)); } /* * This is a fast mixing routine used by the interrupt randomness * collector. It's hardcoded for an 128 bit pool and assumes that any * locks that might be needed are taken by the caller. */ void fast_mix(struct fast_pool *f, u32 const input[4]) { u32 w; unsignedinput_rotate = f->rotate; w = rol32(input[0], input_rotate) ^ f->pool[0] ^ f->pool[3]; f->pool[0] = (w >> 3) ^ twist_table[w & 7]; input_rotate = (input_rotate + 14) & 31; w = rol32(input[1], input_rotate) ^ f->pool[1] ^ f->pool[0]; f->pool[1] = (w >> 3) ^ twist_table[w & 7]; input_rotate = (input_rotate + 7) & 31; w = rol32(input[2], input_rotate) ^ f->pool[2] ^ f->pool[1]; f->pool[2] = (w >> 3) ^ twist_table[w & 7]; input_rotate = (input_rotate + 7) & 31; w = rol32(input[3], input_rotate) ^ f->pool[3] ^ f->pool[2]; f->pool[3] = (w >> 3) ^ twist_table[w & 7]; input_rotate = (input_rotate + 7) & 31; f->rotate = input_rotate; f->count++; } void fast_mix2(struct fast_pool *f, u32 const input[4]) { /* Copy pool to registers */ u32 a = f->pool[0] ^ input[0]; u32 b = f->pool[1] ^ input[1]; u32 c = f->pool[2] ^ input[2]; u32 d = f->pool[3] ^ input[3]; #if 1 int i; for (i = 0; i < 2; i++) { /* * Inspired by ChaCha QuarterRound, but * modified for much greater parallelism. */ a += b; c += d; d ^= a; b ^= c; // a = rol32(a, 24); c = rol32(c, 5); a = rol32(a, 27); c = rol32(c, 7); // d = rol32(d, 27); b = rol32(b, 7); a += b; c += d; d ^= a; b ^= c; a = rol32(a, 17); c = rol32(c, 11); // d = rol32(d, 17); b = rol32(b, 11); #if 0
Re: drivers/char/random.c: More futzing about
On 06/11/2014 12:25 PM, Theodore Ts'o wrote: > On Wed, Jun 11, 2014 at 09:48:31AM -0700, H. Peter Anvin wrote: >> While talking about performance, I did a quick prototype of random using >> Skein instead of SHA-1, and it was measurably faster, in part because >> Skein produces more output per hash. > > Which Skein parameters did you use, and how much stack space was > required for it? Skein-512 is described as needing 200 bytes of > state, IIRC (which I assume most of which comes from Threefish key > schedule). > I believe I used Skein-256, but I'd have to dig to find it again. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: drivers/char/random.c: More futzing about
On Wed, Jun 11, 2014 at 09:48:31AM -0700, H. Peter Anvin wrote: > While talking about performance, I did a quick prototype of random using > Skein instead of SHA-1, and it was measurably faster, in part because > Skein produces more output per hash. Which Skein parameters did you use, and how much stack space was required for it? Skein-512 is described as needing 200 bytes of state, IIRC (which I assume most of which comes from Threefish key schedule). - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: drivers/char/random.c: More futzing about
On 06/11/2014 09:38 AM, Theodore Ts'o wrote: > On Mon, Jun 09, 2014 at 09:17:38AM -0400, George Spelvin wrote: >> Here's an example of a smaller, faster, and better fast_mix() function. >> The mix is invertible (thus preserving entropy), but causes each input >> bit or pair of bits to avalanche to at least 43 bits after 2 rounds and >> 120 bit0 after 3. > > I've been looking at your fast_mix2(), and it does look interesting. > >> For comparison, with the current linear fast_mix function, bits above >> the 12th in the data words only affect 4 other bits after one repetition. >> >> With 3 iterations, it runs in 2/3 the time of the current fast_mix >> and is half the size: 84 bytes of object code as opposed to 168. > > ... but how did you measure the "2/3 the time"? I've done some > measurements, using both "time calling fast_mix() and fast_mix2() N > times and divide by N (where N needs to be quite large). Using that > metric, fast_mix2() takes seven times as long to run. > > If I only run the two mixing functions once, and use RDTSC to measure > the time, fast_mix2() takes only three times as long. (That's because > the memory cache effects are much less, which favors fast_mix2). > > But either way, fast_mix2() is slower than the current fast_mix(), and > using the measurements that are as advantageous (and most realistic) > that I could come up with, it's still three times slower. > > My measurements were done using Intel 2.8 GHz quad-core i7-4900MQ CPU. > While talking about performance, I did a quick prototype of random using Skein instead of SHA-1, and it was measurably faster, in part because Skein produces more output per hash. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: drivers/char/random.c: More futzing about
On Mon, Jun 09, 2014 at 09:17:38AM -0400, George Spelvin wrote: > Here's an example of a smaller, faster, and better fast_mix() function. > The mix is invertible (thus preserving entropy), but causes each input > bit or pair of bits to avalanche to at least 43 bits after 2 rounds and > 120 bit0 after 3. I've been looking at your fast_mix2(), and it does look interesting. > For comparison, with the current linear fast_mix function, bits above > the 12th in the data words only affect 4 other bits after one repetition. > > With 3 iterations, it runs in 2/3 the time of the current fast_mix > and is half the size: 84 bytes of object code as opposed to 168. ... but how did you measure the "2/3 the time"? I've done some measurements, using both "time calling fast_mix() and fast_mix2() N times and divide by N (where N needs to be quite large). Using that metric, fast_mix2() takes seven times as long to run. If I only run the two mixing functions once, and use RDTSC to measure the time, fast_mix2() takes only three times as long. (That's because the memory cache effects are much less, which favors fast_mix2). But either way, fast_mix2() is slower than the current fast_mix(), and using the measurements that are as advantageous (and most realistic) that I could come up with, it's still three times slower. My measurements were done using Intel 2.8 GHz quad-core i7-4900MQ CPU. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/