cprng_fast and interrupts [was Re: Patch: cprng_fast performance - please review.]

2014-04-22 Thread Taylor R Campbell
   Date: Fri, 18 Apr 2014 21:50:46 -0400
   From: Thor Lancelot Simon t...@panix.com

Are there actually any callers of cprng_fast at IPL_HIGH?  Are there
actually any legitimate random decisions to be made at IPL_HIGH?  I'm
sceptical.

   What do you get if you cross an elephant and a rhinocerous?  Given that
   what we do within the spl* takes very little time I am inclined to say
   what spl we go to hardly matters, and be very conservative.  The real
   question here, I think, is whether we should spl*() at all, or forbid
   use of cprng_fast() from interrupt context entirely.

In partial answer to my question, there are calls to libc/libkern
random at IPL_HIGH in various statistical clock interrupt handlers, so
if we want to replace that by cprng_fast we shall need it to work at
IPL_HIGH (which currently it doesn't really!).

There are also many calls to cprng_fast throughout the network stack,
certainly in softint handlers if not also in hard interrupt handlers,
so we need cprng_fast to work at IPL_SOFTNET if not IPL_VM.

Certainly we should not do more than a small constant amount of
computation with interrupts blocked.  For IPL_VM, hundreds or even a
couple thousands of cycles of crypto are probably acceptable -- we
already do, e.g., red/black tree operations at IPL_VM.

But what is the maximum acceptable latency at IPL_HIGH?  Is, say, 400
cycles too long to block interrupts?  Is that too slow for statclock
interrupts?  (Should we have guidelines for these written down?)

Much as I want to discourage the use of non-cryptographic PRNGs for
any purpose, I am willing to entertain the notion that the statclock
interrupt handler may be adequately serviced by an LCG or LFSR too.


Re: Towards design criteria for cprng_fast()

2014-04-22 Thread Taylor R Campbell
   Date: Fri, 18 Apr 2014 16:12:32 -0400
   From: Thor Lancelot Simon t...@panix.com

   With those observations in mind, I offer these design criteria for
   cprng_fast():

   Strength criterion: At the time of the selection of an algorithm
 for cprng_fast(), there should be no known,
 practical cryptographic attack which either:

This is too stringent a criterion to *reject* a crypto algorithm -- it
would not reject a freshly baked cipher nobody has analyzed yet.
Analogous criteria would also not reject, e.g., SHA-1, because nobody
has published collisions.

We ought to have stringent criteria for *adopting* crypto algorithms,
not for rejecting them.

In general, if we're going to adopt a crypto algorithm, then I think
it had better

(a) have substantial review (more than one or two papers discussing
how immature the cryptanalysis is);
(b) not have warning signs, such as secret-dependent array indices or
branches; and
(c) have positive support from professional crypto wizards (not just
crypto tinkers like me).

Even if it is not being used for key material, I don't think we ought
to have it lying around tempting anyone to use it for anything else.


Concerning HC-128, I asked Samuel Neves, a crypto researcher, for his
opinion on HC-128 in this context.  He said he supposed it's `ok as a
a cipher' but `the secret indexes are a no-no' and `my opinion is that
it would require a damn good rationale to adopt HC-XXX, not the other
way around'.


Re: cprng_fast implementation benchmarks

2014-04-22 Thread Taylor R Campbell
   Date: Sun, 20 Apr 2014 03:18:03 -0400
   From: Thor Lancelot Simon t...@panix.com

   The following tests were run:

   1) Cycles-per-byte, 32 bit request: at initial keying of
  the generator and each automatic rekeying.

   2) Generate 4GB of data by requesting 256 bytes at a time
  from userspace using a small C program that loops around
  the kern.arandom sysctl call.  Values are in seconds.

   3) Generate 16GB of data by running 4 copies of the generator
  program in parallel.  Values are in seconds.

It's good to confirm by these tests that going per-CPU didn't hurt
single-threaded bulk throughput and improved scalability -- so much
that just about any halfway reasonable choice of stream cipher would
do better than what we have now or had before.

What these tests don't measure is the impact on overall system
performance, especially via the CPU cache.  I don't expect going
per-CPU to hurt cachewise, but the difference between RC4's 256-byte
state and HC-128's 4 KB state may be noticeable in, say, build.sh, or
perhaps high network load triggering cprng_fasts in ALTQ or IPsec or
something.

   Let's test SALSA20 and discuss further.

I wrote a little code to implement the cprng_fast API using Salsa20 or
ChaCha with buffer for a single cipher core output.  The code, along
with a little driver to measure single-threaded speed and test vectors
to ensure we are computing Salsa20 and ChaCha, is temporarily at

https://www.NetBSD.org/~riastradh/tmp/20140421/cprng_fast/.

The entire state fits in two 64-byte cache lines -- half what RC4
took.  On my Ivy Bridge i7 laptop, I get ~4 cpb throughput in bulk,
average of ~50 cycles for each cprng_fast32, and maximum of ~300
cycles.

(If you need help reproducing my results, let me know.  This is all a
userland mockup, of course, but adapting it to the real kernel
cprng_fast shouldn't be hard.)

This implementation also has the property that the maximum work it
does with interrupts blocked is a single cipher core and copying at
most 64 bytes, plus some additions/subtractions to count things and
maybe schedule a softint.

It services long requests by grabbing a temporary key from the buffer
while interrupts are blocked, and then generating the rest from the
temporary key, which works well because Salsa20 and ChaCha have zero
key setup cost.

   We should run some of these benchmarks on big-endian systems and systems
   that are architecturally unlike the OOO x86-64 I used for my tests.

If I find time I can do some tests on some littleish (by modern
standards) arm and powerpc systems, but I'll have to defer to others
for the VAXen and toasters.