Re: drivers/char/random.c: More futzing about

2014-06-11 Thread George Spelvin
Just to add to my total confusion about the totally disparate performance
numbers we're seeing, I did some benchmarks on other machines.

The speedup isn't as good one-pass as it is iterated, and as I mentioned
it's slower on a P4, but it's not 7 times slower by any stretch.

There are all 1-iteration numbers, run immediately after scp-ing the
binary to the machine so there's no possibility if anything being cached.

(The "64" and "32" versions are compiled -m32 and -m64,
of course.)

2.5 GHz Phenom 9850:

$ /tmp/random64
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
 0:199142 (-57)
 1:104 95 (-9)
 2:104110 (+6)
 3:103109 (+6)
 4:105 89 (-16)
 5:103 88 (-15)
 6:104 89 (-15)
 7:104 95 (-9)
 8:105 85 (-20)
 9:105 85 (-20)
$ /tmp/random32
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
 0:324147 (-177)
 1:100 86 (-14)
 2:100 99 (-1)
 3:100 88 (-12)
 4:100 86 (-14)
 5:100 86 (-14)
 6:100 89 (-11)
 7:100111 (+11)
 8:100111 (+11)
 9:100 88 (-12)
$ /tmp/random64 1
pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e
pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1
 0: 554788 220327 (-334461)
 1: 554825 220176 (-334649)
 2: 553505 220148 (-57)
 3: 554661 220064 (-334597)
 4: 569559 220064 (-349495)
 5: 612798 220065 (-392733)
 6: 570287 220064 (-350223)
 7: 554790 220064 (-334726)
 8: 554715 220065 (-334650)
 9: 569840 220064 (-349776)
$ /tmp/random32 1
pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e
pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1
 0: 520117 280225 (-239892)
 1: 520125 280154 (-239971)
 2: 520104 280094 (-240010)
 3: 520079 280060 (-240019)
 4: 520069 280060 (-240009)
 5: 520060 280060 (-24)
 6: 558971 280060 (-278911)
 7: 520102 280060 (-240042)
 8: 520082 280060 (-240022)
 9: 520058 280060 (-239998)


3 GHz i5-3330:
$ /tmp/random64
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
 0: 78 75 (-3)
 1: 36 33 (-3)
 2: 33 39 (+6)
 3: 36 30 (-6)
 4: 36 33 (-3)
 5: 30 33 (+3)
 6: 30 54 (+24)
 7: 24 48 (+24)
 8: 27 33 (+6)
 9: 30 33 (+3)
$ /tmp/random32
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
 0: 66 78 (+12)
 1: 39 39 (+0)
 2: 36 39 (+3)
 3: 45 33 (-12)
 4: 42 33 (-9)
 5: 33 42 (+9)
 6: 45 33 (-12)
 7: 39 36 (-3)
 8:105 48 (-57)
 9: 42 39 (-3)
$ /tmp/random64 1
pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e
pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1
 0: 406188 218104 (-188084)
 1: 402620 246968 (-155652)
 2: 402652 239840 (-162812)
 3: 402720 200312 (-202408)
 4: 402584 200080 (-202504)
 5: 447488 200228 (-247260)
 6: 402788 200312 (-202476)
 7: 402688 200080 (-202608)
 8: 427140 224320 (-202820)
 9: 402576 200080 (-202496)
$ /tmp/random32 1
pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e
pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1
 0: 406485 266670 (-139815)
 1: 392694 266463 (-126231)
 2: 392496 266763 (-125733)
 3: 426003 266145 (-159858)
 4: 392688 27 (-126021)
 5: 432231 266589 (-165642)
 6: 392754 298734 (-94020)
 7: 392883 284994 (-107889)
 8: 392637 266694 (-125943)
 9: 392985 267024 (-125961)


3.5 GHz i7-2700:
# /tmp/perftest /tmp/random64
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
 0: 82 90 (+8)
 1: 38 41 (+3)
 2: 46 38 (-8)
 3: 35 41 (+6)
 4: 46 41 (-5)
 5: 38 38 (+0)
 6: 41 55 (+14)
 7: 41 35 (-6)
 8: 46 24 (-22)
 9: 35 38 (+3)
# /tmp/perftest /tmp/random32
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
 0: 82 76 (-6)
 1: 32 53 (+21)
 2: 49 44 (-5)
 3: 35 41 (+6)
 4: 46 35 (-11)
 5: 35 44 (+9)
 6: 49 50 (+1)
 7: 41 41 (+0)
 8: 32 44 (+12)
 9: 49 44 (-5)
# /tmp/perfte

Re: drivers/char/random.c: More futzing about

2014-06-11 Thread Theodore Ts'o
On Wed, Jun 11, 2014 at 08:32:49PM -0400, George Spelvin wrote:
> Comparable, but slightly slower.  Clearly, I need to do better.
> And you can see the first-iteration effects clearly.  Still,
> noting *remotely* like 7x!

I redid my numbers, and I can no longer reproduce the 7x slowdown.  I
do see that if you compile w/o -O2, fast_mix2 is twice as slow.  But
it's not 7x slower.

When compiling w/o -O2:

   fast_mix fast_mix2
task-clock 221.3 ms 460.7 ms

When compiling with -O2 -Os:

   fast_mix fast_mix2
task-clock 115.4 ms 71.5 ms

And here's the numbers I got with a single iteration using rdtsc:

fast_mix: 164   fast_mix2: 237
fast_mix: 168   fast_mix2: 230
fast_mix: 166   fast_mix2: 228
fast_mix: 164   fast_mix2: 230
fast_mix: 166   fast_mix2: 230
fast_mix: 168   fast_mix2: 232
fast_mix: 166   fast_mix2: 228
fast_mix: 164   fast_mix2: 228
fast_mix: 166   fast_mix2: 234
fast_mix: 166   fast_mix2: 230

- Ted


#include 
#include 
#include 
#include 

typedef unsigned int __u32;

struct fast_pool {
__u32   pool[4];
unsigned long   last;
unsigned short  count;
unsigned char   rotate;
unsigned char   last_timer_intr;
};


/**
 * rol32 - rotate a 32-bit value left
 * @word: value to rotate
 * @shift: bits to roll
 */
static inline __u32 rol32(__u32 word, unsigned int shift)
{
return (word << shift) | (word >> (32 - shift));
}

static __u32 const twist_table[8] = {
0x, 0x3b6e20c8, 0x76dc4190, 0x4db26158,
0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 };

/*
 * This is a fast mixing routine used by the interrupt randomness
 * collector.  It's hardcoded for an 128 bit pool and assumes that any
 * locks that might be needed are taken by the caller.
 */
extern void fast_mix(struct fast_pool *f, __u32 input[4])
{
__u32   w;
unsignedinput_rotate = f->rotate;

w = rol32(input[0], input_rotate) ^ f->pool[0] ^ f->pool[3];
f->pool[0] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 14) & 31;
w = rol32(input[1], input_rotate) ^ f->pool[1] ^ f->pool[0];
f->pool[1] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 7) & 31;
w = rol32(input[2], input_rotate) ^ f->pool[2] ^ f->pool[1];
f->pool[2] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 7) & 31;
w = rol32(input[3], input_rotate) ^ f->pool[3] ^ f->pool[2];
f->pool[3] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 7) & 31;

f->rotate = input_rotate;
f->count++;
}

extern fast_mix2(struct fast_pool *f, __u32 const input[4])
{
__u32 a = f->pool[0] ^ input[0],  b = f->pool[1] ^ input[1];
__u32 c = f->pool[2] ^ input[2],  d = f->pool[3] ^ input[3];
int i;


for (i = 0; i < 3; i++) {
/*
 * Inspired by ChaCha's QuarterRound, but
 * modified for much greater parallelism.
 * Surprisingly, rotating a and c seems to work
 * better than b and d.  And it runs faster.
 */
a += b; c += d;
d ^= a; b ^= c;
a = rol32(a, 15);   c = rol32(c, 21);

a += b; c += d;
d ^= a; b ^= c;
a = rol32(a, 3);c = rol32(c, 7);
}
f->pool[0] = a;  f->pool[1] = b;
f->pool[2] = c;  f->pool[3] = d;
f->count++;
}

static __inline__ volatile unsigned long long rdtsc(void)
{
  unsigned long long int x;
 __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
 return x;
}

int main(int argc, char **argv)
{
struct fast_pool f;
int i;
__u32 input[4];
unsigned volatile long long start_time, end_time;

memset(&f, 0, sizeof(f));
memset(&input, 0, sizeof(input));
f.pool[0] = 1;

#if !defined(BENCH_FASTMIX) && !defined(BENCH_FASTMIX2)
for (i=0; i < 10; i++) {
usleep(5);
start_time = rdtsc();
fast_mix(&f, input);
end_time = rdtsc();
printf("fast_mix: %llu\t", end_time - start_time);
usleep(5);
start_time = rdtsc();
fast_mix2(&f, input);
end_time = rdtsc();
printf("fast_mix2: %llu\n", end_time - start_time);
}

#endif

#ifdef BENCH_FASTMIX
for (i=0; i < 1024; i++) {
fast_mix(&f, input);
}
#endif

#ifdef BENCH_FASTMIX2
for (i=0; i < 1024; i++) {
fast_mix2(&f, input);
}
#endif
}
--
To unsubscribe from this list: 

Re: drivers/char/random.c: More futzing about

2014-06-11 Thread George Spelvin
> Sadly I can't find the tree, but I'm 94% sure it was Skein-256
> (specifically the SHA3-256 candidate parameter set.)

It would be nice to have two hash functions, optimized separately for 32-
and 64-bit processors.  As the Skein report says, the algorithm can
be adapted to 32 bits easily enough.

I also did some work a while ago to adapt the Skein parameter search
code to develop a Skein-192 (6x32 bits) that would fit into registers
on x86-32.  (It got stalled when I e-mailed Niels Ferguson about it
and never heard back; it fell off the to-do list while I was waiting.)

The intended target was IPv6 address hashing for sequence number
randomization, but it could be used for pool hashing, too.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: drivers/char/random.c: More futzing about

2014-06-11 Thread H. Peter Anvin
On 06/11/2014 01:41 PM, H. Peter Anvin wrote:
> On 06/11/2014 12:25 PM, Theodore Ts'o wrote:
>> On Wed, Jun 11, 2014 at 09:48:31AM -0700, H. Peter Anvin wrote:
>>> While talking about performance, I did a quick prototype of random using
>>> Skein instead of SHA-1, and it was measurably faster, in part because
>>> Skein produces more output per hash.
>>
>> Which Skein parameters did you use, and how much stack space was
>> required for it?  Skein-512 is described as needing 200 bytes of
>> state, IIRC (which I assume most of which comes from Threefish key
>> schedule). 
>>
> 
> I believe I used Skein-256, but I'd have to dig to find it again.
> 
>   -hpa
> 

Sadly I can't find the tree, but I'm 94% sure it was Skein-256
(specifically the SHA3-256 candidate parameter set.)

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: drivers/char/random.c: More futzing about

2014-06-11 Thread George Spelvin
> ... but how did you measure the "2/3 the time"?  I've done some
> measurements, using both "time calling fast_mix() and fast_mix2() N
> times and divide by N (where N needs to be quite large).  Using that
> metric, fast_mix2() takes seven times as long to run.

Wow, *massively* different results.  Using a large iteration count,
it's defintiely faster.  Here's with 1,000,000 iterations:

# ./perftest ./random
pool 1 = 6b469698 596ceb8f 70536d2d e262b8ed
pool 2 = 72e2554d fd20e020 a51baf43 19472f63
 0:   40587696   19952756 (-20634940)
 1:   40569444   19986700 (-20582744)
 2:   40529396   19983108 (-20546288)
 3:   40492468   19959528 (-20532940)
 4:   40461808   19977316 (-20484492)
 5:   40421980   19977436 (-20444544)
 6:   40495440   19959904 (-20535536)
 7:   40436676   19975400 (-20461276)
 8:   40454912   19971360 (-20483552)
 9:   40463260   19957324 (-20505936)

Here's with 1 iteration (which you're very right, I failed to test,
and instruction counts matter more when code is not hot in cache)

# ./perftest ./random
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
 0: 76 84 (+8)
 1: 36 36 (+0)
 2: 32 36 (+4)
 3: 32 40 (+8)
 4: 28 40 (+12)
 5: 32 36 (+4)
 6: 32 36 (+4)
 7: 28 40 (+12)
 8: 32 40 (+8)
 9: 28 40 (+12)

Comparable, but slightly slower.  Clearly, I need to do better.
And you can see the first-iteration effects clearly.  Still,
noting *remotely* like 7x!

That seems crazy, as the overall operation counts are
not that dissimilar.


Here's the "perftest" shell wrapper:
#!/bin/sh

old="`cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`"

for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance > $i
done

"$@"

for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo "$old" > $i
done


And "random.c" is appended.  Apologies for the comemnts and #ifdefs
I used while experimenting; I wanted to give you *exactly* what
I'm running.  Do you see a bug in it anywhere?

I know P4s have slow rotates, so the larger number of them compared
to shifts in the original will definitely cause a hit.  But neither
of us are using that.

Compile line and platform:
cc -W -Wall -m64 -O -march=nativerandom.c   -o random
model name  : Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz




#include 
typedef uint32_t u32;

static u32 const twist_table[8] = {
0x, 0x3b6e20c8, 0x76dc4190, 0x4db26158,
0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 };

struct fast_pool {
u32 pool[4];
unsigned long   last;
unsigned short  count;
unsigned char   rotate;
unsigned char   last_timer_intr;
};

static inline u32 rol32(u32 word, unsigned int shift)
{
return (word << shift) | (word >> (32 - shift));
}


/*
 * This is a fast mixing routine used by the interrupt randomness
 * collector.  It's hardcoded for an 128 bit pool and assumes that any
 * locks that might be needed are taken by the caller.
 */
void fast_mix(struct fast_pool *f, u32 const input[4])
{
u32 w;
unsignedinput_rotate = f->rotate;

w = rol32(input[0], input_rotate) ^ f->pool[0] ^ f->pool[3];
f->pool[0] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 14) & 31;
w = rol32(input[1], input_rotate) ^ f->pool[1] ^ f->pool[0];
f->pool[1] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 7) & 31;
w = rol32(input[2], input_rotate) ^ f->pool[2] ^ f->pool[1];
f->pool[2] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 7) & 31;
w = rol32(input[3], input_rotate) ^ f->pool[3] ^ f->pool[2];
f->pool[3] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 7) & 31;

f->rotate = input_rotate;
f->count++;
}

void fast_mix2(struct fast_pool *f, u32 const input[4])
{
/* Copy pool to registers */
u32 a = f->pool[0] ^ input[0];
u32 b = f->pool[1] ^ input[1];
u32 c = f->pool[2] ^ input[2];
u32 d = f->pool[3] ^ input[3];
#if 1
int i;

for (i = 0; i < 2; i++) {
/*
 * Inspired by ChaCha QuarterRound, but
 * modified for much greater parallelism.
 */
a += b; c += d;
d ^= a; b ^= c;
//  a = rol32(a, 24);   c = rol32(c, 5);
a = rol32(a, 27);   c = rol32(c, 7);
//  d = rol32(d, 27);   b = rol32(b, 7);

a += b; c += d;
d ^= a; b ^= c;
a = rol32(a, 17);   c = rol32(c, 11);
//  d = rol32(d, 17);   b = rol32(b, 11);
#if 0
   

Re: drivers/char/random.c: More futzing about

2014-06-11 Thread H. Peter Anvin
On 06/11/2014 12:25 PM, Theodore Ts'o wrote:
> On Wed, Jun 11, 2014 at 09:48:31AM -0700, H. Peter Anvin wrote:
>> While talking about performance, I did a quick prototype of random using
>> Skein instead of SHA-1, and it was measurably faster, in part because
>> Skein produces more output per hash.
> 
> Which Skein parameters did you use, and how much stack space was
> required for it?  Skein-512 is described as needing 200 bytes of
> state, IIRC (which I assume most of which comes from Threefish key
> schedule). 
> 

I believe I used Skein-256, but I'd have to dig to find it again.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: drivers/char/random.c: More futzing about

2014-06-11 Thread Theodore Ts'o
On Wed, Jun 11, 2014 at 09:48:31AM -0700, H. Peter Anvin wrote:
> While talking about performance, I did a quick prototype of random using
> Skein instead of SHA-1, and it was measurably faster, in part because
> Skein produces more output per hash.

Which Skein parameters did you use, and how much stack space was
required for it?  Skein-512 is described as needing 200 bytes of
state, IIRC (which I assume most of which comes from Threefish key
schedule). 

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: drivers/char/random.c: More futzing about

2014-06-11 Thread H. Peter Anvin
On 06/11/2014 09:38 AM, Theodore Ts'o wrote:
> On Mon, Jun 09, 2014 at 09:17:38AM -0400, George Spelvin wrote:
>> Here's an example of a smaller, faster, and better fast_mix() function.
>> The mix is invertible (thus preserving entropy), but causes each input
>> bit or pair of bits to avalanche to at least 43 bits after 2 rounds and
>> 120 bit0 after 3.
> 
> I've been looking at your fast_mix2(), and it does look interesting.
> 
>> For comparison, with the current linear fast_mix function, bits above
>> the 12th in the data words only affect 4 other bits after one repetition.
>>
>> With 3 iterations, it runs in 2/3 the time of the current fast_mix
>> and is half the size: 84 bytes of object code as opposed to 168.
> 
> ... but how did you measure the "2/3 the time"?  I've done some
> measurements, using both "time calling fast_mix() and fast_mix2() N
> times and divide by N (where N needs to be quite large).  Using that
> metric, fast_mix2() takes seven times as long to run.
> 
> If I only run the two mixing functions once, and use RDTSC to measure
> the time, fast_mix2() takes only three times as long.  (That's because
> the memory cache effects are much less, which favors fast_mix2).
> 
> But either way, fast_mix2() is slower than the current fast_mix(), and
> using the measurements that are as advantageous (and most realistic)
> that I could come up with, it's still three times slower.
> 
> My measurements were done using Intel 2.8 GHz quad-core i7-4900MQ CPU.
> 

While talking about performance, I did a quick prototype of random using
Skein instead of SHA-1, and it was measurably faster, in part because
Skein produces more output per hash.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: drivers/char/random.c: More futzing about

2014-06-11 Thread Theodore Ts'o
On Mon, Jun 09, 2014 at 09:17:38AM -0400, George Spelvin wrote:
> Here's an example of a smaller, faster, and better fast_mix() function.
> The mix is invertible (thus preserving entropy), but causes each input
> bit or pair of bits to avalanche to at least 43 bits after 2 rounds and
> 120 bit0 after 3.

I've been looking at your fast_mix2(), and it does look interesting.

> For comparison, with the current linear fast_mix function, bits above
> the 12th in the data words only affect 4 other bits after one repetition.
> 
> With 3 iterations, it runs in 2/3 the time of the current fast_mix
> and is half the size: 84 bytes of object code as opposed to 168.

... but how did you measure the "2/3 the time"?  I've done some
measurements, using both "time calling fast_mix() and fast_mix2() N
times and divide by N (where N needs to be quite large).  Using that
metric, fast_mix2() takes seven times as long to run.

If I only run the two mixing functions once, and use RDTSC to measure
the time, fast_mix2() takes only three times as long.  (That's because
the memory cache effects are much less, which favors fast_mix2).

But either way, fast_mix2() is slower than the current fast_mix(), and
using the measurements that are as advantageous (and most realistic)
that I could come up with, it's still three times slower.

My measurements were done using Intel 2.8 GHz quad-core i7-4900MQ CPU.

  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/