Re: random: Benchamrking fast_mix2

2014-06-15 Thread Theodore Ts'o
On Sun, Jun 15, 2014 at 02:58:03AM -0400, George Spelvin wrote: > Actually, could you hold off merging that for a bit? Or if you really > want to, add a comment that the rotate constants are preliminary and > will be further optimized. > > I posted that as WIP, just as evidence that something wit

Re: random: Benchamrking fast_mix2

2014-06-14 Thread George Spelvin
Actually, could you hold off merging that for a bit? Or if you really want to, add a comment that the rotate constants are preliminary and will be further optimized. I posted that as WIP, just as evidence that something with better avalanche could be as fast or faster, and I'd like to go back and

Re: random: Benchamrking fast_mix2

2014-06-14 Thread Theodore Ts'o
On Sat, Jun 14, 2014 at 08:23:33PM -0400, George Spelvin wrote: > The example I posted: > > // (29/66353) score = 49/121/123: 6 27 16 14 > > a += b; c += d; > b = rol32(a, 6);d = rol32(c, 27); > d ^= a; b ^= c; > > a += b;

Re: random: Benchamrking fast_mix2

2014-06-14 Thread George Spelvin
> I'll need to do a bit more looking to convince myself that 2 > iterations is better from a mixing perspective, but this looks like it > might be a promising replacement for the 32-bit mixer. This was tested with a hacked-up copy of Bob Jenkins' testing code. It tests a few random starting value

Re: random: Benchamrking fast_mix2

2014-06-14 Thread Theodore Ts'o
OK, using your averaging scheme, on a 32-bit KVM kernel, running at idle, here are my quick results: Original 48028 51419 46021 50065 44750 49231 Fastmix 2, 3 interations 95956 58313 97295 57599 97242 56942 Fastmix 2, 2 iterations 68998 41496 68940 41471 68619 41576 Fastmix 2, 2 iteratio

Re: random: Benchamrking fast_mix2

2014-06-14 Thread George Spelvin
Unfortunately, my test suite is not particularly lightweight. Here are some figures (idle at full speed, and doing a parallel kernel compile) for a 2.5 GHz Phenom. Quite different numbers compared to the Ivy Bridge. When the processor is bust, fast_mix takes longer, instead. And even the rolled

Re: random: Benchamrking fast_mix2

2014-06-14 Thread George Spelvin
And I have of course embarrassed myself publicly by getting the sign wrong. That's what I get for posting *before* booting the result. You may now point and bray like a donkey. :-) Anyway. the following actually works: #if ADD_INTERRUPT_BENCH static unsigned long avg_cycles, avg_deviation; #d

Re: random: Benchamrking fast_mix2

2014-06-14 Thread George Spelvin
> Want to give this patch a try on your collection of machines? Not until I fix it to +#if ADD_INTERRUPT_BENCH +static unsigned long avg_cycles; + +#define AVG_SHIFT 8/* Exponential average factor k=1/256 */ +#define FIXED_1_2 (1 << (AVG_SHIFT-1)) + +static void add_interrupt_bench(cycles_t

Re: random: Benchamrking fast_mix2

2014-06-13 Thread Theodore Ts'o
On Sat, Jun 14, 2014 at 01:25:27AM -0400, George Spelvin wrote: > > When I did a quick comparison of your 64-bit fast_mix2 variant, it's > > much slower than either the 32-bit fast_mix2, or the original fast_mix > > alrogithm. > > That is f***ing *bizarre*. For me, it's *significantly* faster. >

Re: random: Benchamrking fast_mix2

2014-06-13 Thread Theodore Ts'o
OK, so I normally do my testing using 32-bit kernels under KVM. On a i386 kernel, running under kvm (again using a Lenovo T540 with a i7-4900MQ CPU), with the original fast_mix, I get a timestamp count of 166 cycles using a weighted smoothed average. Using your fast_mix2 I get around 450 cycles.

Re: random: Benchamrking fast_mix2

2014-06-13 Thread George Spelvin
> At least for Intel, between its branch predictor and speculative > execution engine, it doesn't make a difference. *Sigh*. We need live measurement. My testing (in your test harness!) showed a noticeable (~10%) speedup. > When I did a quick comparison of your 64-bit fast_mix2 variant, it's

Re: random: Benchamrking fast_mix2

2014-06-13 Thread Theodore Ts'o
On Fri, Jun 13, 2014 at 10:10:14PM -0400, George Spelvin wrote: > > Unrolling doesn't make much difference; which isn't surprising given > > that almost all of the differences go away when I commented out the > > udelay(). Basically, at this point what we're primarily measuring is > > how good var

Re: random: Benchamrking fast_mix2

2014-06-13 Thread George Spelvin
> Unrolling doesn't make much difference; which isn't surprising given > that almost all of the differences go away when I commented out the > udelay(). Basically, at this point what we're primarily measuring is > how good various CPU's caches work, especially across context switches > where other

Re: random: Benchamrking fast_mix2

2014-06-13 Thread Theodore Ts'o
On Thu, Jun 12, 2014 at 08:23:04PM -0400, George Spelvin wrote: > Another cache we might be hitting is the branch predictor. Could you try > unrolling fast_mix2 and fast_mix4 and see what difference that makes? > (I'd send you a patch but you could probably do it by hand faster than > appying one.

Re: random: Benchamrking fast_mix2

2014-06-12 Thread George Spelvin
> So I just tried your modified 32-bit mixing function where you the > rotation to the middle step instead of the last step. With the > usleep(), it doesn't make any difference: > > # schedtool -R -p 1 -e /tmp/fast_mix2_48 > fast_mix: 212 fast_mix2: 400 fast_mix3: 400 > fast_mix: 208 fast_mix2:

Re: random: Benchamrking fast_mix2

2014-06-12 Thread Theodore Ts'o
So I just tried your modified 32-bit mixing function where you the rotation to the middle step instead of the last step. With the usleep(), it doesn't make any difference: # schedtool -R -p 1 -e /tmp/fast_mix2_48 fast_mix: 212 fast_mix2: 400 fast_mix3: 400 fast_mix: 208 fast_mix2: 408 fast_

Re: random: Benchamrking fast_mix2

2014-06-12 Thread Theodore Ts'o
One of the reasons for the timing instability is because of the usleep() calls. This allows some other process to schedule, and thus disrupts the I-cache and uop caches. With the usleep() calls: i7-4900MQ# schedtool -R -p 1 -e /tmp/fast_mix2_49 fast_mix: 213fast_mix2: 280 fast_mix: 336

Re: random: Benchamrking fast_mix2

2014-06-12 Thread George Spelvin
Ted, just as an example of a possible tweak, the following change seems to sightly improve speed without reducing diffusion. (It now looks a bit more like the Skein mix() function.) I'll look for more even more efficient kernels. It appears that the inital XOR is costing a lot; 8 memory fetches

random: Benchamrking fast_mix2

2014-06-11 Thread George Spelvin
> I redid my numbers, and I can no longer reproduce the 7x slowdown. I > do see that if you compile w/o -O2, fast_mix2 is twice as slow. But > it's not 7x slower. For my single-round, I needed to drop to 2 loops rather than 3 to match the speed. That's in the source I posted, but I didn't point