Re: cprng_fast implementation benchmarks

2014-04-26 Thread Greg Troxel

After reading all these messages a bit too fast, it seems to be that we
can add to Thor's analysis

  Moving the fast-not-super-strong PRNG to ChaCha8 is clearly a step
  forward from what we have now.  (really this is his conclusion)

  It remains for Someone to do more formal work (perhaps in an academic
  context) to give specifications for good-enough-random, to analayze if
  our implementation meets the specification, and if our uses can
  reasonably rely on them.   (I think the lack of this is the essence of
  what Paul is pointing out, and it's a fair point).

I'll ask around to see if I can find a spare intern.  (That's sort of a
joke but not 100%; this does seem like useful work to do.)


pgp9yfd0JhRlX.pgp
Description: PGP signature


Re: cprng_fast implementation benchmarks

2014-04-26 Thread Paul_Koning

On Apr 25, 2014, at 10:09 PM, Thor Lancelot Simon t...@panix.com wrote:

 On Fri, Apr 25, 2014 at 08:58:54PM +, paul_kon...@dell.com wrote:
 
 But what X are we talking about?  Security analysis does not come from
 generalities, it comes from the point by point analysis of specific
 questions.  You correctly point out that some attacks may not be
 relevant.  Which attacks are not relevant, and which ones are, and why?
 
 I feel like you keep moving the goal posts. 

That was not my intent.  But I can see why it seems that way, and I apologize 
for that.

What’s going on here is that I’m learning about what is being proposed as I 
work through a sequence of messages.  

paul


Re: cprng_fast implementation benchmarks

2014-04-25 Thread Paul_Koning

On Apr 24, 2014, at 10:09 PM, Thor Lancelot Simon t...@panix.com wrote:

 On Thu, Apr 24, 2014 at 07:23:41PM +, paul_kon...@dell.com wrote:
 
 I do disagree.  The reason is that I see no requirements that make it
 possible to decide whether the weak generator is useful.
 
 Here are some cases where the fast CPRNG is used:
 
   1) Generation of ip_id values
   2) Generation of ephemeral port numbers
   3) Generation of explicit IVs for use in cryptographic network
  protocols
   4) Randomization of addresses for mappings of userspace processes
   5) Generation of the random canary used by the userspace stack
  smash protection code
 
 I think these cases have some things in common and we can begin to intuit
 some general rules from them.

That’s not clear to me, probably because I haven’t seen descriptions of the 
vulnerabilities.  I know there is some discussion of #3 that I can find; I 
don’t know about the others.

 
 It seems to me that we are generally looking at cases where:
 
   * The value of the resource directly being protected is ephemeral
 
   * An active attack is required in order to exploit any weakness
 in the RNG
 
   * The constant factor for an adversary to attack the RNG
 is quite large (for example, making a process crash across the
 network)
 
   * Nonetheless attacks on non-cryptographic RNGs have been empirically
 sucessful in all of these cases.
 
 I don't think anyone knows much about how to design a RNG that's just strong
 enough even if we had a much more detailed threat model (though one might
 argue that one could keep subtracting rounds from ChaCha and arrive at ChaCha4
 or so forth; and there is the intriguing suggestion to be found various places
 that, in fact, the NSA does do things very much like this, with Skipjack
 having been designed to be just as strong as it needed to be and no stronger).

Note that I was not advocating “just strong enough”.  I was advocating “strong 
enough”, which requires knowing what the minimum is so we can be comfortably 
beyond that.
 
 But we do know how to select an RNG that meets some minimum strength criteria
 and is faster than the fast RNG we already have.  And that is what we are
 doing in this discussion.

Yes, the discussion is about an RNG that is weaker than the existing strong 
RNG.  How much weaker is not clear.  But the critical point is that I can’t 
tell whether it is stronger than the minimum required, or weaker than that.  

paul



Re: cprng_fast implementation benchmarks

2014-04-25 Thread Thor Lancelot Simon
On Fri, Apr 25, 2014 at 01:53:13PM +, paul_kon...@dell.com wrote:
 
 Yes, the discussion is about an RNG that is weaker than the existing
 strong RNG.  How much weaker is not clear.

There's not a single answer, because the CTR_DRBG is designed to resist
attacks that really don't seem relevant here.  It's not a simple
question of whether one cipher is stronger than other.

Even if you compare the core transforms (AES-128 in the case of the
CTR_DRBG vs ChaCha8) it's not at all clear that ChaCha8 is any weaker.

There is not any currently known attack on 8 rounds of ChaCha that is
better than brute force on its 256-bit key.  AES-128 is, at best, 128
bits strong.

 that I can?t tell whether it is stronger than the minimum required, or
 weaker than that.

At present, if we are talking simply about the strength of the cipher
itself (rather than about properties such as backtracking resistance)
there's no attack better than brute-forcing the 256 bit key.  It seems
to me that is probably good enough.

Thor


Re: cprng_fast implementation benchmarks

2014-04-25 Thread Paul_Koning

On Apr 25, 2014, at 4:31 PM, Thor Lancelot Simon t...@panix.com wrote:

 On Fri, Apr 25, 2014 at 01:53:13PM +, paul_kon...@dell.com wrote:
 
 Yes, the discussion is about an RNG that is weaker than the existing
 strong RNG.  How much weaker is not clear.
 
 There's not a single answer, because the CTR_DRBG is designed to resist
 attacks that really don't seem relevant here.  It's not a simple
 question of whether one cipher is stronger than other.

I agree.  I don’t think I said that.  And I can easily imagine that some 
consumers of random numbers are not concerned about attack X that another one 
does care about and that the strong generator is designed to resist.

But what X are we talking about?  Security analysis does not come from 
generalities, it comes from the point by point analysis of specific questions.  
You correctly point out that some attacks may not be relevant.  Which attacks 
are not relevant, and which ones are, and why?

 
 Even if you compare the core transforms (AES-128 in the case of the
 CTR_DRBG vs ChaCha8) it's not at all clear that ChaCha8 is any weaker.
 
 There is not any currently known attack on 8 rounds of ChaCha that is
 better than brute force on its 256-bit key.  AES-128 is, at best, 128
 bits strong.

True, but of course there is AES-256 if you need that.

The real difference is that the body of research against AES is vastly greater 
than that against ChaCha.

 
 that I can?t tell whether it is stronger than the minimum required, or
 weaker than that.
 
 At present, if we are talking simply about the strength of the cipher
 itself (rather than about properties such as backtracking resistance)
 there's no attack better than brute-forcing the 256 bit key.  It seems
 to me that is probably good enough.

I was specifically NOT talking about the strength of the underlying cipher as 
the core question.  The interesting questions all relate to the attacks against 
the RNG consumer, and what RNG properties each of those consumers needs to foil 
the interesting attacks.

For RNGs, backtrack resistance is an example of the kind of property that 
matters (or rather, might matter — it depends on what the RNG consumer needs).  
  Depending on the RNG construction, its strength against interesting attacks 
may be equal to the strength of the underlying cipher against key recovery 
attack, or it may be less.  

paul


Re: cprng_fast implementation benchmarks

2014-04-25 Thread Thor Lancelot Simon
On Fri, Apr 25, 2014 at 08:58:54PM +, paul_kon...@dell.com wrote:
 
 But what X are we talking about?  Security analysis does not come from
 generalities, it comes from the point by point analysis of specific
 questions.  You correctly point out that some attacks may not be
 relevant.  Which attacks are not relevant, and which ones are, and why?

I feel like you keep moving the goal posts.  I have a limited amount of
time to respond -- I wish I had more.  But this is going to be my last
message on this topic.

If you want a rigorous, numerical analysis of threat and risk, I can't give
you one.  I offered a number of examples and pointed out what seemed like
commonalities to me.  Perhaps they don't seem that way to you.

This, I can say: you started out by objecting (as far as I can tell) to what
you perceived as a weakening of the kernel strong cprng.  That is not the
case.  Then -- it seems to me -- you changed over to arguing that there was
no point to having both strong and fast cryptograpnic prngs in the kernel.

Then -- inexplicably to me -- you seemed to change tack to arguing that
the proposed fast cprng might not be strong enough.  I suppose you are
entitled to change your mind.  God knows I do often enough.

Here is the bottom line for me; perhaps I did not get it across clearly
enough last time:

* There are certain cases in our kernel for which:

* We cannot use the strong CPRNG without causing
  undue performance impact, but

* Where in these exact or closely analogous cases
  real security issues have resulted from the use
  of non-cryptographic PRNGs such as LCGs.

* Fortunately for many of these cases the cost per try for
  an attack on the PRNG is high, which suggests -- though we
  cannot give a precise numerical estimate -- that a PRNG
  considerably weaker than the strong CPRNG is acceptable.

* Even better, in many of these cases the PRNG need resist
  attack only for a limited lifetime such as that of a
  single network connection or a single process.

* For these cases we would like something that we can only
  rather vaguely describe as somewhat stronger than the
  non-cryptographic generators but much faster than the
  strong CPRNG.

* The rough consensus of the developers is that ChaCha8, which
  is much faster than the CTR_DRBG *or* the previous cprng_fast
  implementation, and much stronger than old implementation,
  is a very good match for these -- admittedly informal --
  requirements.

As an aside, a stream cipher from the Salsa/ChaCha family was the
recommendation of two different cryptographers independently consulted
by Taylor and myself for this application.  I am not comfortable dialing
back the number of rounds further to get more speed, nor do I think we
need more rounds to get more security.  But since this cipher is never
used to key anything that persists, we can adjust this in the future if
at that time, it seems wise.

Thor


Re: cprng_fast implementation benchmarks

2014-04-24 Thread Mindaugas Rasiukevicius
paul_kon...@dell.com wrote:
  No.  There is a strong RNG, based on the NIST SP800-90 CTR_DRBG with
  AES128 as the block transform.
  
  There is also a fast RNG, based on RC4.
  
  We are discussing the replacement of the fast RNG.
 
 Ok. But if that’s a non-strong RNG, why are we discussing security
 properties?  And why are we considering algorithms this complex, rather
 than using a PRNG?
 
 In other words, this is being treated like it’s in between a PRNG and a
 strong RNG.  I don’t understand why there can be a middle ground like
 that, and what its required properties would be.

There are cases when both security and performance matters.  Consider
TCP ISN generation or UDP port number generation (i.e. randomisation).
There are known security issues if these numbers can be predicted, but
at the same time, high performance penalty is undesirable in the network
stack.  However, the requirements are a bit different: the life time of
a packet or connection tends to be much smaller than of some encrypted
and permanently stored piece of information.  Arguably, given a short
life time, a weaker (but faster) CPRNG is enough for making potential
attacks unpractical.  Do you disagree?

Finally, if we make a weaker CPRNG really fast, why bother with a third
implementation of PRNG?

-- 
Mindaugas



Re: cprng_fast implementation benchmarks

2014-04-24 Thread Paul_Koning

On Apr 24, 2014, at 12:40 PM, Mindaugas Rasiukevicius rm...@netbsd.org wrote:

 paul_kon...@dell.com wrote:
 No.  There is a strong RNG, based on the NIST SP800-90 CTR_DRBG with
 AES128 as the block transform.
 
 There is also a fast RNG, based on RC4.
 
 We are discussing the replacement of the fast RNG.
 
 Ok. But if that’s a non-strong RNG, why are we discussing security
 properties?  And why are we considering algorithms this complex, rather
 than using a PRNG?
 
 In other words, this is being treated like it’s in between a PRNG and a
 strong RNG.  I don’t understand why there can be a middle ground like
 that, and what its required properties would be.
 
 There are cases when both security and performance matters.  Consider
 TCP ISN generation or UDP port number generation (i.e. randomisation).
 There are known security issues if these numbers can be predicted, but
 at the same time, high performance penalty is undesirable in the network
 stack.  However, the requirements are a bit different: the life time of
 a packet or connection tends to be much smaller than of some encrypted
 and permanently stored piece of information.  Arguably, given a short
 life time, a weaker (but faster) CPRNG is enough for making potential
 attacks unpractical.  Do you disagree?

I think I do.  The description you gave seems to amount to: we need something 
that is better than a PRNG but it doesn’t have to be as strong as the real 
crypto RNG we have.  But that’s not a particularly precise definition, and it’s 
hard to judge whether a proposed implementation meets the requirements, or not. 
 

In general, where security issues are involved, it is desirable to have 
properties expressed quantitatively.  For example, security equivalent to brute 
force search over a 2^128 (128 bit) key space.  Or brute force over  some other 
2^n (n bit) key space.

Knowing that there are “security issues” with UDP port number generation may 
mean that a PRNG is inadequate.  Deciding what sort of generator IS adequate, 
though, means starting with a more definite description of the nature of the 
attacks that we’re worried about, and the strength of the defense that is 
desired.

 
 Finally, if we make a weaker CPRNG really fast, why bother with a third
 implementation of PRNG?

Only if the “weaker CPRNG” is good enough.  Right now I don’t see any way to 
decide that.

Now, if you were to propose using the AES-based strong RNG for the functions in 
question, we’re probably done as far as security is concerned; then the only 
remaining question is whether that function can be optimized well enough that 
we’re comfortable with its performance.

paul



Re: cprng_fast implementation benchmarks

2014-04-24 Thread Mindaugas Rasiukevicius
paul_kon...@dell.com wrote:
  There are cases when both security and performance matters.  Consider
  TCP ISN generation or UDP port number generation (i.e. randomisation).
  There are known security issues if these numbers can be predicted, but
  at the same time, high performance penalty is undesirable in the network
  stack.  However, the requirements are a bit different: the life time of
  a packet or connection tends to be much smaller than of some encrypted
  and permanently stored piece of information.  Arguably, given a short
  life time, a weaker (but faster) CPRNG is enough for making potential
  attacks unpractical.  Do you disagree?
 
 I think I do.  The description you gave seems to amount to: we need
 something that is better than a PRNG but it doesn’t have to be as strong
 as the real crypto RNG we have.  But that’s not a particularly precise
 definition, and it’s hard to judge whether a proposed implementation
 meets the requirements, or not.  
 
 In general, where security issues are involved, it is desirable to have
 properties expressed quantitatively.  For example, security equivalent to
 brute force search over a 2^128 (128 bit) key space.  Or brute force
 over  some other 2^n (n bit) key space.
 
 Knowing that there are “security issues” with UDP port number generation
 may mean that a PRNG is inadequate.  Deciding what sort of generator IS
 adequate, though, means starting with a more definite description of the
 nature of the attacks that we’re worried about, and the strength of the
 defense that is desired.

But you do not disagree with the concept of having weak and strong CPRNG,
do you?  I think what you are basically saying is that we should take more
academic approach in a way we classify weak and strong.  Yes, I agree
with that.  Thor made a brief overview in his Towards design criteria for
cprng_fast() email which is somewhat a step to that direction, but doing
it properly requires a study.  That requires human resources which we may
or may not have.  Do you know potential volunteers?

Meanwhile, Thor's work is a step forwards from what we have in the tree,
regardless whether weak/strong definition improves or not.

-- 
Mindaugas



Re: cprng_fast implementation benchmarks

2014-04-24 Thread Paul_Koning

On Apr 24, 2014, at 3:03 PM, Mindaugas Rasiukevicius rm...@netbsd.org wrote:

 paul_kon...@dell.com wrote:
 ...
 Knowing that there are “security issues” with UDP port number generation
 may mean that a PRNG is inadequate.  Deciding what sort of generator IS
 adequate, though, means starting with a more definite description of the
 nature of the attacks that we’re worried about, and the strength of the
 defense that is desired.
 
 But you do not disagree with the concept of having weak and strong CPRNG,
 do you?  

I do disagree.  The reason is that I see no requirements that make it possible 
to decide whether the weak generator is useful.

If it useful only if there are random number consumers that have requirements 
that a simple PRNG can’t satisfy, and the workload is high enough that the 
achievable performance of the strong RNG is a concern, and there exists an RNG 
algorithm that meets both the performance needs and the security needs of those 
consumers.

There’s a lot of discussion about performance.  And some general statements 
about security.  But I don’t see the data that allows anyone to decide the 
question I stated.

In the absence of a “yes” answer, indeed I do disagree with the concept.

paul



Re: cprng_fast implementation benchmarks

2014-04-24 Thread Thor Lancelot Simon
On Thu, Apr 24, 2014 at 07:23:41PM +, paul_kon...@dell.com wrote:
 
 I do disagree.  The reason is that I see no requirements that make it
 possible to decide whether the weak generator is useful.

Here are some cases where the fast CPRNG is used:

1) Generation of ip_id values
2) Generation of ephemeral port numbers
3) Generation of explicit IVs for use in cryptographic network
   protocols
4) Randomization of addresses for mappings of userspace processes
5) Generation of the random canary used by the userspace stack
   smash protection code

I think these cases have some things in common and we can begin to intuit
some general rules from them.

It seems to me that we are generally looking at cases where:

* The value of the resource directly being protected is ephemeral

* An active attack is required in order to exploit any weakness
  in the RNG

* The constant factor for an adversary to attack the RNG
  is quite large (for example, making a process crash across the
  network)

* Nonetheless attacks on non-cryptographic RNGs have been empirically
  sucessful in all of these cases.

I don't think anyone knows much about how to design a RNG that's just strong
enough even if we had a much more detailed threat model (though one might
argue that one could keep subtracting rounds from ChaCha and arrive at ChaCha4
or so forth; and there is the intriguing suggestion to be found various places
that, in fact, the NSA does do things very much like this, with Skipjack
having been designed to be just as strong as it needed to be and no stronger).

But we do know how to select an RNG that meets some minimum strength criteria
and is faster than the fast RNG we already have.  And that is what we are
doing in this discussion.

Thor


Re: cprng_fast implementation benchmarks

2014-04-23 Thread Joerg Sonnenberger
On Tue, Apr 22, 2014 at 11:59:38PM -0400, Thor Lancelot Simon wrote:
 I believe ChaCha8 is suitable for our purpose: we were previously considering
 ciphers with, at most, 128-bit security, and even 6-round ChaCha has 139-bit
 strength against the best currently known attack (at present, there is no
 attack better than brute force on ChaCha8, and the best attack on ChaCha7
 is 2^248).  ChaCha8 appears to be somewhat faster than the old arc4 
 implementation.

Sounds wrong. When I measured Salsa20/8, it was ~3 times faster than
RC4. Code can be found at
http://www.netbsd.org/~joerg/arc4random_salsa.c.

Joerg


Re: cprng_fast implementation benchmarks

2014-04-23 Thread Mindaugas Rasiukevicius
Thor Lancelot Simon t...@panix.com wrote:
RESULTS
 
 kernelcpb (32 bit)4GB (1 way) 16GB (4 ways)   Scaling Factor
 ----- -   --
 arc4-mtx  35  42.58   398.83  0.106
 arc4-nomtx24  42.12   2338.92 0.018
 arc4-percpu   27  33.63   41.59   0.808
 hc128-percpu  21  23.75   34.90   0.680
 hc128-inline  19  22.66   31.75   0.713
 chacha8   22  20.51   30.45   0.662
 chacha12  24  24.87   34.32   0.724
 chacha20  28  30.45   39.28   0.775

Looks good!

You mentioned that 4GB of data are generated by requesting 256 bytes.
It would be more interesting to see the throughput of 4 byte requests
i.e. how many cprng_fast32() calls per second can we do?

My vote would go to a cipher of Salsa20 family.  They undergone quite
more cryptoanalysis than HC-128.

-- 
Mindaugas


Re: cprng_fast implementation benchmarks

2014-04-23 Thread Mindaugas Rasiukevicius
Manuel Bouyer bou...@antioche.eu.org wrote:
 On Wed, Apr 23, 2014 at 09:16:33AM -0400, Thor Lancelot Simon wrote:
  [...]
  Do we still have a compile-time way to check if the kernel (or port) is
  uniprocessor only?  If so we should probably #ifdef away the percpu
  calls in such kernels, which are probably for slower hardware anyway.
 
 AFAIK options MULTIPROCESSOR is still here
 

It is, but it should generally be avoided, unless the optimisation is
really significant.  Also, make it least invasive, e.g. instead of #ifdef
in a function body, do something like:

#ifdef MULTIPROCESSOR
#define cprng_getctx()  percpu_getref(percpu_cprng_fast_ctx)
#define cprng_putctx()  percpu_putref(percpu_cprng_fast_ctx)
#else
static cprng_fast_ctx_t cprng_global_ctx;
#define cprng_getctx()  (cprng_global_ctx)
#define cprng_putctx()
#endif

Not sure if it is worth, though.

-- 
Mindaugas


Re: cprng_fast implementation benchmarks

2014-04-23 Thread Paul_Koning

On Apr 22, 2014, at 11:59 PM, Thor Lancelot Simon t...@panix.com wrote:

 ...
 RESULTS
 
 kernelcpb (32 bit)4GB (1 way) 16GB (4 ways)   Scaling 
 Factor
 ----- -   
 --
 arc4-mtx  35  42.58   398.83  0.106
 arc4-nomtx24  42.12   2338.92 0.018
 arc4-percpu   27  33.63   41.59   0.808
 hc128-percpu  21  23.75   34.90   0.680
 hc128-inline  19  22.66   31.75   0.713
 chacha8   22  20.51   30.45   0.662
 chacha12  24  24.87   34.32   0.724
 chacha20  28  30.45   39.28   0.775
 
 I believe ChaCha8 is suitable for our purpose: we were previously considering
 ciphers with, at most, 128-bit security, and even 6-round ChaCha has 139-bit
 strength against the best currently known attack (at present, there is no
 attack better than brute force on ChaCha8, and the best attack on ChaCha7
 is 2^248).  ChaCha8 appears to be somewhat faster than the old arc4 
 implementation.

I’ve been watching this long stream of messages flying by, and I’m a bit 
concerned about the approach.

As I understand it, there is a strong RNG, based on RC4 (“ARC4”) in the kernel 
today.  It is used by some things that require strong random numbers, and also 
by things that don’t (or at most, have weaker requirements than the 
cryptographic operations that require serious strength).  It isn’t clear to me 
what fraction of the workload really requires the cryptographically strong 
generator.

Replacing the existing strong generator by a new one that is faster is 
tempting.  The question is: how much confidence do you need in the new 
algorithm?

For things like port randomization, not much.  One might argue that a PRNG is 
fine for that.  On the other hand, for those spots where cryptographic random 
numbers are required, a lot.

It isn’t at all clear to me that the proposed replacements are sufficiently 
well analyzed.  If someone were to propose a replacement whose security is 
demonstrably that of the AES block cipher, I’d be a whole lot more comfortable, 
because AES is one of the very few cryptosystems out there that has had 
significant analysis.  (RC4 may be another, though in that case there have been 
some results that raise concern.)

It would probably also be useful to identify the various uses of kernel RNGs, 
and document clearly what their security requirements (if any) are.  

paul



Re: cprng_fast implementation benchmarks

2014-04-23 Thread Joerg Sonnenberger
On Wed, Apr 23, 2014 at 09:16:33AM -0400, Thor Lancelot Simon wrote:
 On Wed, Apr 23, 2014 at 10:57:59AM +0200, Joerg Sonnenberger wrote:
  On Tue, Apr 22, 2014 at 11:59:38PM -0400, Thor Lancelot Simon wrote:
   I believe ChaCha8 is suitable for our purpose: we were previously 
   considering
   ciphers with, at most, 128-bit security, and even 6-round ChaCha has 
   139-bit
   strength against the best currently known attack (at present, there is no
   attack better than brute force on ChaCha8, and the best attack on ChaCha7
   is 2^248).  ChaCha8 appears to be somewhat faster than the old arc4 
   implementation.
  
  Sounds wrong. When I measured Salsa20/8, it was ~3 times faster than
  RC4. Code can be found at
  http://www.netbsd.org/~joerg/arc4random_salsa.c.
 
 That's a libc implementation -- and were you calling it for 32 bits at a
 time, or bulk data?

I measured either case, extracting 32bits at a time vs doing larger
operations.

Joerg


Re: cprng_fast implementation benchmarks

2014-04-23 Thread David Laight
On Wed, Apr 23, 2014 at 03:30:09PM +0200, Manuel Bouyer wrote:
 On Wed, Apr 23, 2014 at 09:16:33AM -0400, Thor Lancelot Simon wrote:
  [...]
  Do we still have a compile-time way to check if the kernel (or port) is
  uniprocessor only?  If so we should probably #ifdef away the percpu calls
  in such kernels, which are probably for slower hardware anyway.
 
 AFAIK options MULTIPROCESSOR is still here

Do the percpu() calls collapse out for non-MULTIPROCESSOR kernels?

In any case you'd want to do what is done with some of the mutex code.
ie overwrite the code of the SMP version with that of the
uniprocessor one if the current system only has one cpu.

David

-- 
David Laight: da...@l8s.co.uk


Re: cprng_fast implementation benchmarks

2014-04-23 Thread Thor Lancelot Simon
On Wed, Apr 23, 2014 at 02:21:31PM +, paul_kon...@dell.com wrote:
 
 I?ve been watching this long stream of messages flying by, and I?m a bit 
 concerned about the approach.
 
 As I understand it, there is a strong RNG, based on RC4 (?ARC4?) in the 
 kernel today.

No.  There is a strong RNG, based on the NIST SP800-90 CTR_DRBG with AES128
as the block transform.

There is also a fast RNG, based on RC4.

We are discussing the replacement of the fast RNG.

Thor


Re: cprng_fast implementation benchmarks

2014-04-22 Thread Taylor R Campbell
   Date: Sun, 20 Apr 2014 03:18:03 -0400
   From: Thor Lancelot Simon t...@panix.com

   The following tests were run:

   1) Cycles-per-byte, 32 bit request: at initial keying of
  the generator and each automatic rekeying.

   2) Generate 4GB of data by requesting 256 bytes at a time
  from userspace using a small C program that loops around
  the kern.arandom sysctl call.  Values are in seconds.

   3) Generate 16GB of data by running 4 copies of the generator
  program in parallel.  Values are in seconds.

It's good to confirm by these tests that going per-CPU didn't hurt
single-threaded bulk throughput and improved scalability -- so much
that just about any halfway reasonable choice of stream cipher would
do better than what we have now or had before.

What these tests don't measure is the impact on overall system
performance, especially via the CPU cache.  I don't expect going
per-CPU to hurt cachewise, but the difference between RC4's 256-byte
state and HC-128's 4 KB state may be noticeable in, say, build.sh, or
perhaps high network load triggering cprng_fasts in ALTQ or IPsec or
something.

   Let's test SALSA20 and discuss further.

I wrote a little code to implement the cprng_fast API using Salsa20 or
ChaCha with buffer for a single cipher core output.  The code, along
with a little driver to measure single-threaded speed and test vectors
to ensure we are computing Salsa20 and ChaCha, is temporarily at

https://www.NetBSD.org/~riastradh/tmp/20140421/cprng_fast/.

The entire state fits in two 64-byte cache lines -- half what RC4
took.  On my Ivy Bridge i7 laptop, I get ~4 cpb throughput in bulk,
average of ~50 cycles for each cprng_fast32, and maximum of ~300
cycles.

(If you need help reproducing my results, let me know.  This is all a
userland mockup, of course, but adapting it to the real kernel
cprng_fast shouldn't be hard.)

This implementation also has the property that the maximum work it
does with interrupts blocked is a single cipher core and copying at
most 64 bytes, plus some additions/subtractions to count things and
maybe schedule a softint.

It services long requests by grabbing a temporary key from the buffer
while interrupts are blocked, and then generating the rest from the
temporary key, which works well because Salsa20 and ChaCha have zero
key setup cost.

   We should run some of these benchmarks on big-endian systems and systems
   that are architecturally unlike the OOO x86-64 I used for my tests.

If I find time I can do some tests on some littleish (by modern
standards) arm and powerpc systems, but I'll have to defer to others
for the VAXen and toasters.


Re: cprng_fast implementation benchmarks

2014-04-21 Thread Dennis Ferguson

On 20 Apr, 2014, at 00:18 , Thor Lancelot Simon t...@panix.com wrote:
 Personally, I'd go with the HC128 variant without the ugly inlining.  We
 restrict bytes-on-key to 512MB -- the best current attack on HC128 requires
 2^65 bytes to distinguish its keystream, so I think it's a reasonable choice
 at this time.  Percpu ARC4 is really a strawman; RC4 is just not safe at any
 speed any longer.  Let's test SALSA20 and discuss further.

The thing that bothers me about HC-128 is that its output is apparently not
very random.  This paper

http://viper.eng.iastate.edu/gmdev/pubs/tantra.pdf

shows the number of failures it gets when run through the TestU01 batteries.
While one would need to see the p-values of the failures to know if these are
kind of bad or really bad, the fact that even the Small Crush battery
is finding stuff (that battery takes under 10 seconds to run) is making
this seem, well, not good.  I personally might overlook this if there
were some offsetting advantages to HC-128, but the fact that the mediocre
random numbers it produces are being paid for with 4500 bytes of context is
making me think it can't be a good idea.

When Salsa20/ChaCha was mentioned I turned ChaCha into a PRNG library, available
here

http://www.netbsd.org/~dennis/ccrand.tgz

if it is useful, and ran the three variations through TestU01 batteries.  
ChaCha8
produced two failures, one in Crush (about half an hour of CPU time to run) and
one in Big Crush (about 3 hours).  ChaCha12 produced a single failure, in Crush.
I fumble-fingered running ChaCha20 and had to redo that, but it has now run
everything except the last half hour of Big Crush with no failures so far.  For
comparison, arc4random() gets a couple of failures in Big Crush, the Mersenne
Twister (which had a good reputation as a fast, high quality non-cryptographic
PRNG) also gets a couple failures and ISAAC, my prior favorite PRNG, runs the
entire set of batteries clean even though it uses less than half the state space
of HC-128 (AES and SHA-1 also run clean, but more slowly).  Given that ChaCha
seems to be keeping pretty good company in terms of random number quality, and
manages to do this securely as well with only 128 bytes of state (i.e. 40 ChaCha
instances consume the same state space as one HC-128 instance), I am thinking
that this seems like a pretty good idea even if it turns out that ChaCha20 isn't
as easy on the CPU as one would prefer.

The reason I think minimizing the amount of state an instance of the PRNG 
requires
is important derives from something clearly demonstrated by your measurements,
that often the cost of computing something from shared data can be hugely 
dominated
not by the computation but rather by the sharing, in terms the cost of taking
locks or in having that data frequently invalidated from your cache or whatever
else you do to ensure mutual exclusion.  I think the approach you are taking,
of maintaining more instances of the data which are shared much less widely,
is exactly right, but I don't think per-CPU plus splhigh() is a recipe that 
works
long term.  It is correct to point out that the random number generator won't be
doing much at splhigh(), but what is incorrect is to assume that the random 
number
generator is the only thing that will need a solution to this problem.  If
per-CPU+splhigh() is the solution to all such problems then aggregate may end up
running at splhigh() a lot, even if each individual application doesn't do much,
and I don't think that is a good outcome.

So I think it might be useful if someone eventually figured out how to
eliminate the splhigh() in favor of even more instances shared even
less widely (but more compatibly), say per-CPU, per-interrupt-level or
even per-application.  If it turns out that having a large number of
random number instances is the best arrangement then the cost of having
a large number of instances becomes relevant.  ChaCha's 128 bytes of state
per instance makes it highly desirable, while HC-128's 4500 bytes per instance
seems a lot to pay for not very random numbers even if it computes them
quickly in benchmarks.

Dennis Ferguson