So I just tried your modified 32-bit mixing function where you the
rotation to the middle step instead of the last step.  With the
usleep(), it doesn't make any difference:

# schedtool -R -p 1 -e /tmp/fast_mix2_48
fast_mix: 212  fast_mix2: 400   fast_mix3: 400
fast_mix: 208  fast_mix2: 408   fast_mix3: 388
fast_mix: 208  fast_mix2: 396   fast_mix3: 404
fast_mix: 224  fast_mix2: 408   fast_mix3: 392
fast_mix: 200  fast_mix2: 404   fast_mix3: 404
fast_mix: 208  fast_mix2: 412   fast_mix3: 396
fast_mix: 208  fast_mix2: 392   fast_mix3: 392
fast_mix: 212  fast_mix2: 408   fast_mix3: 388
fast_mix: 200  fast_mix2: 716   fast_mix3: 773
fast_mix: 426  fast_mix2: 717   fast_mix3: 728

without the usleep() I get:

692# schedtool -R -p 1 -e /tmp/fast_mix2_48
fast_mix: 104  fast_mix2: 224   fast_mix3: 176  
fast_mix: 56   fast_mix2: 112   fast_mix3: 56   
fast_mix: 56   fast_mix2: 64    fast_mix3: 64   
fast_mix: 64   fast_mix2: 64    fast_mix3: 48   
fast_mix: 56   fast_mix2: 64    fast_mix3: 56   
fast_mix: 56   fast_mix2: 64    fast_mix3: 64   
fast_mix: 56   fast_mix2: 64    fast_mix3: 64   
fast_mix: 56   fast_mix2: 72    fast_mix3: 56   
fast_mix: 56   fast_mix2: 64    fast_mix3: 56   
fast_mix: 64   fast_mix2: 64    fast_mix3: 56   

I'm beginning to suspect that some of the differences between your
measurements and mine might be that in addition to having a smaller
cache (8M instead of 12M), I suspect there are some other caches,
perhaps the uop cache, which are also smaller on the mobile processor,
and that is explaining why you are seeing some different results.

> 
> Of course, using wider words works fantastically.
> These constants give 76 bits if avalanche after 2 rounds,
> essentially full after 3....

And here is my testing using your 64-bit variant:

# schedtool -R -p 1 -e /tmp/fast_mix2_49
fast_mix: 294  fast_mix2: 476   fast_mix4: 442
fast_mix: 286  fast_mix2: 1058  fast_mix4: 448
fast_mix: 958  fast_mix2: 460   fast_mix4: 1002
fast_mix: 940  fast_mix2: 1176  fast_mix4: 826
fast_mix: 476  fast_mix2: 840   fast_mix4: 826
fast_mix: 462  fast_mix2: 840   fast_mix4: 826
fast_mix: 462  fast_mix2: 826   fast_mix4: 826
fast_mix: 462  fast_mix2: 826   fast_mix4: 826
fast_mix: 462  fast_mix2: 826   fast_mix4: 826
fast_mix: 462  fast_mix2: 840   fast_mix4: 826

... and without usleep()

690# schedtool -R -p 1 -e /tmp/fast_mix2_48
fast_mix: 52   fast_mix2: 116   fast_mix4: 96   
fast_mix: 32   fast_mix2: 32    fast_mix4: 24   
fast_mix: 28   fast_mix2: 36    fast_mix4: 24   
fast_mix: 32   fast_mix2: 32    fast_mix4: 24   
fast_mix: 32   fast_mix2: 36    fast_mix4: 24   
fast_mix: 36   fast_mix2: 32    fast_mix4: 24   
fast_mix: 32   fast_mix2: 36    fast_mix4: 28   
fast_mix: 28   fast_mix2: 28    fast_mix4: 24   
fast_mix: 32   fast_mix2: 36    fast_mix4: 28   
fast_mix: 32   fast_mix2: 32    fast_mix4: 24   

The bottom line is that what we are primarily measuring here is all
different cache effects.  And these are going to be quite different on
different microarchitectures.

That being said, I wouldn't be at all surprised if there are some
CPU's where the extract memory dereference to the twist_table[] would
definitely hurt, since Intel's amazing cache architecture(tm) is no
doubt covering a lot of sins.  I wouldn't be at all surprised if some
of these new mixing functions would fare much better if we tried
benchmarking them on an 32-bit ARM processor, for example....

                                              - Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to