cprng_fast and interrupts [was Re: Patch: cprng_fast performance - please review.]
Date: Fri, 18 Apr 2014 21:50:46 -0400 From: Thor Lancelot Simon t...@panix.com Are there actually any callers of cprng_fast at IPL_HIGH? Are there actually any legitimate random decisions to be made at IPL_HIGH? I'm sceptical. What do you get if you cross an elephant and a rhinocerous? Given that what we do within the spl* takes very little time I am inclined to say what spl we go to hardly matters, and be very conservative. The real question here, I think, is whether we should spl*() at all, or forbid use of cprng_fast() from interrupt context entirely. In partial answer to my question, there are calls to libc/libkern random at IPL_HIGH in various statistical clock interrupt handlers, so if we want to replace that by cprng_fast we shall need it to work at IPL_HIGH (which currently it doesn't really!). There are also many calls to cprng_fast throughout the network stack, certainly in softint handlers if not also in hard interrupt handlers, so we need cprng_fast to work at IPL_SOFTNET if not IPL_VM. Certainly we should not do more than a small constant amount of computation with interrupts blocked. For IPL_VM, hundreds or even a couple thousands of cycles of crypto are probably acceptable -- we already do, e.g., red/black tree operations at IPL_VM. But what is the maximum acceptable latency at IPL_HIGH? Is, say, 400 cycles too long to block interrupts? Is that too slow for statclock interrupts? (Should we have guidelines for these written down?) Much as I want to discourage the use of non-cryptographic PRNGs for any purpose, I am willing to entertain the notion that the statclock interrupt handler may be adequately serviced by an LCG or LFSR too.
Re: Towards design criteria for cprng_fast()
Date: Fri, 18 Apr 2014 16:12:32 -0400 From: Thor Lancelot Simon t...@panix.com With those observations in mind, I offer these design criteria for cprng_fast(): Strength criterion: At the time of the selection of an algorithm for cprng_fast(), there should be no known, practical cryptographic attack which either: This is too stringent a criterion to *reject* a crypto algorithm -- it would not reject a freshly baked cipher nobody has analyzed yet. Analogous criteria would also not reject, e.g., SHA-1, because nobody has published collisions. We ought to have stringent criteria for *adopting* crypto algorithms, not for rejecting them. In general, if we're going to adopt a crypto algorithm, then I think it had better (a) have substantial review (more than one or two papers discussing how immature the cryptanalysis is); (b) not have warning signs, such as secret-dependent array indices or branches; and (c) have positive support from professional crypto wizards (not just crypto tinkers like me). Even if it is not being used for key material, I don't think we ought to have it lying around tempting anyone to use it for anything else. Concerning HC-128, I asked Samuel Neves, a crypto researcher, for his opinion on HC-128 in this context. He said he supposed it's `ok as a a cipher' but `the secret indexes are a no-no' and `my opinion is that it would require a damn good rationale to adopt HC-XXX, not the other way around'.
Re: cprng_fast implementation benchmarks
Date: Sun, 20 Apr 2014 03:18:03 -0400 From: Thor Lancelot Simon t...@panix.com The following tests were run: 1) Cycles-per-byte, 32 bit request: at initial keying of the generator and each automatic rekeying. 2) Generate 4GB of data by requesting 256 bytes at a time from userspace using a small C program that loops around the kern.arandom sysctl call. Values are in seconds. 3) Generate 16GB of data by running 4 copies of the generator program in parallel. Values are in seconds. It's good to confirm by these tests that going per-CPU didn't hurt single-threaded bulk throughput and improved scalability -- so much that just about any halfway reasonable choice of stream cipher would do better than what we have now or had before. What these tests don't measure is the impact on overall system performance, especially via the CPU cache. I don't expect going per-CPU to hurt cachewise, but the difference between RC4's 256-byte state and HC-128's 4 KB state may be noticeable in, say, build.sh, or perhaps high network load triggering cprng_fasts in ALTQ or IPsec or something. Let's test SALSA20 and discuss further. I wrote a little code to implement the cprng_fast API using Salsa20 or ChaCha with buffer for a single cipher core output. The code, along with a little driver to measure single-threaded speed and test vectors to ensure we are computing Salsa20 and ChaCha, is temporarily at https://www.NetBSD.org/~riastradh/tmp/20140421/cprng_fast/. The entire state fits in two 64-byte cache lines -- half what RC4 took. On my Ivy Bridge i7 laptop, I get ~4 cpb throughput in bulk, average of ~50 cycles for each cprng_fast32, and maximum of ~300 cycles. (If you need help reproducing my results, let me know. This is all a userland mockup, of course, but adapting it to the real kernel cprng_fast shouldn't be hard.) This implementation also has the property that the maximum work it does with interrupts blocked is a single cipher core and copying at most 64 bytes, plus some additions/subtractions to count things and maybe schedule a softint. It services long requests by grabbing a temporary key from the buffer while interrupts are blocked, and then generating the rest from the temporary key, which works well because Salsa20 and ChaCha have zero key setup cost. We should run some of these benchmarks on big-endian systems and systems that are architecturally unlike the OOO x86-64 I used for my tests. If I find time I can do some tests on some littleish (by modern standards) arm and powerpc systems, but I'll have to defer to others for the VAXen and toasters.