Date: Sun, 20 Apr 2014 03:18:03 -0400 From: Thor Lancelot Simon <t...@panix.com>
The following tests were run: 1) Cycles-per-byte, 32 bit request: at initial keying of the generator and each automatic rekeying. 2) Generate 4GB of data by requesting 256 bytes at a time from userspace using a small C program that loops around the kern.arandom sysctl call. Values are in seconds. 3) Generate 16GB of data by running 4 copies of the generator program in parallel. Values are in seconds. It's good to confirm by these tests that going per-CPU didn't hurt single-threaded bulk throughput and improved scalability -- so much that just about any halfway reasonable choice of stream cipher would do better than what we have now or had before. What these tests don't measure is the impact on overall system performance, especially via the CPU cache. I don't expect going per-CPU to hurt cachewise, but the difference between RC4's 256-byte state and HC-128's 4 KB state may be noticeable in, say, build.sh, or perhaps high network load triggering cprng_fasts in ALTQ or IPsec or something. Let's test SALSA20 and discuss further. I wrote a little code to implement the cprng_fast API using Salsa20 or ChaCha with buffer for a single cipher core output. The code, along with a little driver to measure single-threaded speed and test vectors to ensure we are computing Salsa20 and ChaCha, is temporarily at <https://www.NetBSD.org/~riastradh/tmp/20140421/cprng_fast/>. The entire state fits in two 64-byte cache lines -- half what RC4 took. On my Ivy Bridge i7 laptop, I get ~4 cpb throughput in bulk, average of ~50 cycles for each cprng_fast32, and maximum of ~300 cycles. (If you need help reproducing my results, let me know. This is all a userland mockup, of course, but adapting it to the real kernel cprng_fast shouldn't be hard.) This implementation also has the property that the maximum work it does with interrupts blocked is a single cipher core and copying at most 64 bytes, plus some additions/subtractions to count things and maybe schedule a softint. It services long requests by grabbing a temporary key from the buffer while interrupts are blocked, and then generating the rest from the temporary key, which works well because Salsa20 and ChaCha have zero key setup cost. We should run some of these benchmarks on big-endian systems and systems that are architecturally unlike the OOO x86-64 I used for my tests. If I find time I can do some tests on some littleish (by modern standards) arm and powerpc systems, but I'll have to defer to others for the VAXen and toasters.