Re: cprng_fast implementation benchmarks
After reading all these messages a bit too fast, it seems to be that we can add to Thor's analysis Moving the fast-not-super-strong PRNG to ChaCha8 is clearly a step forward from what we have now. (really this is his conclusion) It remains for Someone to do more formal work (perhaps in an academic context) to give specifications for good-enough-random, to analayze if our implementation meets the specification, and if our uses can reasonably rely on them. (I think the lack of this is the essence of what Paul is pointing out, and it's a fair point). I'll ask around to see if I can find a spare intern. (That's sort of a joke but not 100%; this does seem like useful work to do.) pgp9yfd0JhRlX.pgp Description: PGP signature
Re: cprng_fast implementation benchmarks
On Apr 25, 2014, at 10:09 PM, Thor Lancelot Simon t...@panix.com wrote: On Fri, Apr 25, 2014 at 08:58:54PM +, paul_kon...@dell.com wrote: But what X are we talking about? Security analysis does not come from generalities, it comes from the point by point analysis of specific questions. You correctly point out that some attacks may not be relevant. Which attacks are not relevant, and which ones are, and why? I feel like you keep moving the goal posts. That was not my intent. But I can see why it seems that way, and I apologize for that. What’s going on here is that I’m learning about what is being proposed as I work through a sequence of messages. paul
Re: cprng_fast implementation benchmarks
On Apr 24, 2014, at 10:09 PM, Thor Lancelot Simon t...@panix.com wrote: On Thu, Apr 24, 2014 at 07:23:41PM +, paul_kon...@dell.com wrote: I do disagree. The reason is that I see no requirements that make it possible to decide whether the weak generator is useful. Here are some cases where the fast CPRNG is used: 1) Generation of ip_id values 2) Generation of ephemeral port numbers 3) Generation of explicit IVs for use in cryptographic network protocols 4) Randomization of addresses for mappings of userspace processes 5) Generation of the random canary used by the userspace stack smash protection code I think these cases have some things in common and we can begin to intuit some general rules from them. That’s not clear to me, probably because I haven’t seen descriptions of the vulnerabilities. I know there is some discussion of #3 that I can find; I don’t know about the others. It seems to me that we are generally looking at cases where: * The value of the resource directly being protected is ephemeral * An active attack is required in order to exploit any weakness in the RNG * The constant factor for an adversary to attack the RNG is quite large (for example, making a process crash across the network) * Nonetheless attacks on non-cryptographic RNGs have been empirically sucessful in all of these cases. I don't think anyone knows much about how to design a RNG that's just strong enough even if we had a much more detailed threat model (though one might argue that one could keep subtracting rounds from ChaCha and arrive at ChaCha4 or so forth; and there is the intriguing suggestion to be found various places that, in fact, the NSA does do things very much like this, with Skipjack having been designed to be just as strong as it needed to be and no stronger). Note that I was not advocating “just strong enough”. I was advocating “strong enough”, which requires knowing what the minimum is so we can be comfortably beyond that. But we do know how to select an RNG that meets some minimum strength criteria and is faster than the fast RNG we already have. And that is what we are doing in this discussion. Yes, the discussion is about an RNG that is weaker than the existing strong RNG. How much weaker is not clear. But the critical point is that I can’t tell whether it is stronger than the minimum required, or weaker than that. paul
Re: cprng_fast implementation benchmarks
On Fri, Apr 25, 2014 at 01:53:13PM +, paul_kon...@dell.com wrote: Yes, the discussion is about an RNG that is weaker than the existing strong RNG. How much weaker is not clear. There's not a single answer, because the CTR_DRBG is designed to resist attacks that really don't seem relevant here. It's not a simple question of whether one cipher is stronger than other. Even if you compare the core transforms (AES-128 in the case of the CTR_DRBG vs ChaCha8) it's not at all clear that ChaCha8 is any weaker. There is not any currently known attack on 8 rounds of ChaCha that is better than brute force on its 256-bit key. AES-128 is, at best, 128 bits strong. that I can?t tell whether it is stronger than the minimum required, or weaker than that. At present, if we are talking simply about the strength of the cipher itself (rather than about properties such as backtracking resistance) there's no attack better than brute-forcing the 256 bit key. It seems to me that is probably good enough. Thor
Re: cprng_fast implementation benchmarks
On Apr 25, 2014, at 4:31 PM, Thor Lancelot Simon t...@panix.com wrote: On Fri, Apr 25, 2014 at 01:53:13PM +, paul_kon...@dell.com wrote: Yes, the discussion is about an RNG that is weaker than the existing strong RNG. How much weaker is not clear. There's not a single answer, because the CTR_DRBG is designed to resist attacks that really don't seem relevant here. It's not a simple question of whether one cipher is stronger than other. I agree. I don’t think I said that. And I can easily imagine that some consumers of random numbers are not concerned about attack X that another one does care about and that the strong generator is designed to resist. But what X are we talking about? Security analysis does not come from generalities, it comes from the point by point analysis of specific questions. You correctly point out that some attacks may not be relevant. Which attacks are not relevant, and which ones are, and why? Even if you compare the core transforms (AES-128 in the case of the CTR_DRBG vs ChaCha8) it's not at all clear that ChaCha8 is any weaker. There is not any currently known attack on 8 rounds of ChaCha that is better than brute force on its 256-bit key. AES-128 is, at best, 128 bits strong. True, but of course there is AES-256 if you need that. The real difference is that the body of research against AES is vastly greater than that against ChaCha. that I can?t tell whether it is stronger than the minimum required, or weaker than that. At present, if we are talking simply about the strength of the cipher itself (rather than about properties such as backtracking resistance) there's no attack better than brute-forcing the 256 bit key. It seems to me that is probably good enough. I was specifically NOT talking about the strength of the underlying cipher as the core question. The interesting questions all relate to the attacks against the RNG consumer, and what RNG properties each of those consumers needs to foil the interesting attacks. For RNGs, backtrack resistance is an example of the kind of property that matters (or rather, might matter — it depends on what the RNG consumer needs). Depending on the RNG construction, its strength against interesting attacks may be equal to the strength of the underlying cipher against key recovery attack, or it may be less. paul
Re: cprng_fast implementation benchmarks
On Fri, Apr 25, 2014 at 08:58:54PM +, paul_kon...@dell.com wrote: But what X are we talking about? Security analysis does not come from generalities, it comes from the point by point analysis of specific questions. You correctly point out that some attacks may not be relevant. Which attacks are not relevant, and which ones are, and why? I feel like you keep moving the goal posts. I have a limited amount of time to respond -- I wish I had more. But this is going to be my last message on this topic. If you want a rigorous, numerical analysis of threat and risk, I can't give you one. I offered a number of examples and pointed out what seemed like commonalities to me. Perhaps they don't seem that way to you. This, I can say: you started out by objecting (as far as I can tell) to what you perceived as a weakening of the kernel strong cprng. That is not the case. Then -- it seems to me -- you changed over to arguing that there was no point to having both strong and fast cryptograpnic prngs in the kernel. Then -- inexplicably to me -- you seemed to change tack to arguing that the proposed fast cprng might not be strong enough. I suppose you are entitled to change your mind. God knows I do often enough. Here is the bottom line for me; perhaps I did not get it across clearly enough last time: * There are certain cases in our kernel for which: * We cannot use the strong CPRNG without causing undue performance impact, but * Where in these exact or closely analogous cases real security issues have resulted from the use of non-cryptographic PRNGs such as LCGs. * Fortunately for many of these cases the cost per try for an attack on the PRNG is high, which suggests -- though we cannot give a precise numerical estimate -- that a PRNG considerably weaker than the strong CPRNG is acceptable. * Even better, in many of these cases the PRNG need resist attack only for a limited lifetime such as that of a single network connection or a single process. * For these cases we would like something that we can only rather vaguely describe as somewhat stronger than the non-cryptographic generators but much faster than the strong CPRNG. * The rough consensus of the developers is that ChaCha8, which is much faster than the CTR_DRBG *or* the previous cprng_fast implementation, and much stronger than old implementation, is a very good match for these -- admittedly informal -- requirements. As an aside, a stream cipher from the Salsa/ChaCha family was the recommendation of two different cryptographers independently consulted by Taylor and myself for this application. I am not comfortable dialing back the number of rounds further to get more speed, nor do I think we need more rounds to get more security. But since this cipher is never used to key anything that persists, we can adjust this in the future if at that time, it seems wise. Thor
Re: cprng_fast implementation benchmarks
paul_kon...@dell.com wrote: No. There is a strong RNG, based on the NIST SP800-90 CTR_DRBG with AES128 as the block transform. There is also a fast RNG, based on RC4. We are discussing the replacement of the fast RNG. Ok. But if that’s a non-strong RNG, why are we discussing security properties? And why are we considering algorithms this complex, rather than using a PRNG? In other words, this is being treated like it’s in between a PRNG and a strong RNG. I don’t understand why there can be a middle ground like that, and what its required properties would be. There are cases when both security and performance matters. Consider TCP ISN generation or UDP port number generation (i.e. randomisation). There are known security issues if these numbers can be predicted, but at the same time, high performance penalty is undesirable in the network stack. However, the requirements are a bit different: the life time of a packet or connection tends to be much smaller than of some encrypted and permanently stored piece of information. Arguably, given a short life time, a weaker (but faster) CPRNG is enough for making potential attacks unpractical. Do you disagree? Finally, if we make a weaker CPRNG really fast, why bother with a third implementation of PRNG? -- Mindaugas
Re: cprng_fast implementation benchmarks
On Apr 24, 2014, at 12:40 PM, Mindaugas Rasiukevicius rm...@netbsd.org wrote: paul_kon...@dell.com wrote: No. There is a strong RNG, based on the NIST SP800-90 CTR_DRBG with AES128 as the block transform. There is also a fast RNG, based on RC4. We are discussing the replacement of the fast RNG. Ok. But if that’s a non-strong RNG, why are we discussing security properties? And why are we considering algorithms this complex, rather than using a PRNG? In other words, this is being treated like it’s in between a PRNG and a strong RNG. I don’t understand why there can be a middle ground like that, and what its required properties would be. There are cases when both security and performance matters. Consider TCP ISN generation or UDP port number generation (i.e. randomisation). There are known security issues if these numbers can be predicted, but at the same time, high performance penalty is undesirable in the network stack. However, the requirements are a bit different: the life time of a packet or connection tends to be much smaller than of some encrypted and permanently stored piece of information. Arguably, given a short life time, a weaker (but faster) CPRNG is enough for making potential attacks unpractical. Do you disagree? I think I do. The description you gave seems to amount to: we need something that is better than a PRNG but it doesn’t have to be as strong as the real crypto RNG we have. But that’s not a particularly precise definition, and it’s hard to judge whether a proposed implementation meets the requirements, or not. In general, where security issues are involved, it is desirable to have properties expressed quantitatively. For example, security equivalent to brute force search over a 2^128 (128 bit) key space. Or brute force over some other 2^n (n bit) key space. Knowing that there are “security issues” with UDP port number generation may mean that a PRNG is inadequate. Deciding what sort of generator IS adequate, though, means starting with a more definite description of the nature of the attacks that we’re worried about, and the strength of the defense that is desired. Finally, if we make a weaker CPRNG really fast, why bother with a third implementation of PRNG? Only if the “weaker CPRNG” is good enough. Right now I don’t see any way to decide that. Now, if you were to propose using the AES-based strong RNG for the functions in question, we’re probably done as far as security is concerned; then the only remaining question is whether that function can be optimized well enough that we’re comfortable with its performance. paul
Re: cprng_fast implementation benchmarks
paul_kon...@dell.com wrote: There are cases when both security and performance matters. Consider TCP ISN generation or UDP port number generation (i.e. randomisation). There are known security issues if these numbers can be predicted, but at the same time, high performance penalty is undesirable in the network stack. However, the requirements are a bit different: the life time of a packet or connection tends to be much smaller than of some encrypted and permanently stored piece of information. Arguably, given a short life time, a weaker (but faster) CPRNG is enough for making potential attacks unpractical. Do you disagree? I think I do. The description you gave seems to amount to: we need something that is better than a PRNG but it doesn’t have to be as strong as the real crypto RNG we have. But that’s not a particularly precise definition, and it’s hard to judge whether a proposed implementation meets the requirements, or not. In general, where security issues are involved, it is desirable to have properties expressed quantitatively. For example, security equivalent to brute force search over a 2^128 (128 bit) key space. Or brute force over some other 2^n (n bit) key space. Knowing that there are “security issues” with UDP port number generation may mean that a PRNG is inadequate. Deciding what sort of generator IS adequate, though, means starting with a more definite description of the nature of the attacks that we’re worried about, and the strength of the defense that is desired. But you do not disagree with the concept of having weak and strong CPRNG, do you? I think what you are basically saying is that we should take more academic approach in a way we classify weak and strong. Yes, I agree with that. Thor made a brief overview in his Towards design criteria for cprng_fast() email which is somewhat a step to that direction, but doing it properly requires a study. That requires human resources which we may or may not have. Do you know potential volunteers? Meanwhile, Thor's work is a step forwards from what we have in the tree, regardless whether weak/strong definition improves or not. -- Mindaugas
Re: cprng_fast implementation benchmarks
On Apr 24, 2014, at 3:03 PM, Mindaugas Rasiukevicius rm...@netbsd.org wrote: paul_kon...@dell.com wrote: ... Knowing that there are “security issues” with UDP port number generation may mean that a PRNG is inadequate. Deciding what sort of generator IS adequate, though, means starting with a more definite description of the nature of the attacks that we’re worried about, and the strength of the defense that is desired. But you do not disagree with the concept of having weak and strong CPRNG, do you? I do disagree. The reason is that I see no requirements that make it possible to decide whether the weak generator is useful. If it useful only if there are random number consumers that have requirements that a simple PRNG can’t satisfy, and the workload is high enough that the achievable performance of the strong RNG is a concern, and there exists an RNG algorithm that meets both the performance needs and the security needs of those consumers. There’s a lot of discussion about performance. And some general statements about security. But I don’t see the data that allows anyone to decide the question I stated. In the absence of a “yes” answer, indeed I do disagree with the concept. paul
Re: cprng_fast implementation benchmarks
On Thu, Apr 24, 2014 at 07:23:41PM +, paul_kon...@dell.com wrote: I do disagree. The reason is that I see no requirements that make it possible to decide whether the weak generator is useful. Here are some cases where the fast CPRNG is used: 1) Generation of ip_id values 2) Generation of ephemeral port numbers 3) Generation of explicit IVs for use in cryptographic network protocols 4) Randomization of addresses for mappings of userspace processes 5) Generation of the random canary used by the userspace stack smash protection code I think these cases have some things in common and we can begin to intuit some general rules from them. It seems to me that we are generally looking at cases where: * The value of the resource directly being protected is ephemeral * An active attack is required in order to exploit any weakness in the RNG * The constant factor for an adversary to attack the RNG is quite large (for example, making a process crash across the network) * Nonetheless attacks on non-cryptographic RNGs have been empirically sucessful in all of these cases. I don't think anyone knows much about how to design a RNG that's just strong enough even if we had a much more detailed threat model (though one might argue that one could keep subtracting rounds from ChaCha and arrive at ChaCha4 or so forth; and there is the intriguing suggestion to be found various places that, in fact, the NSA does do things very much like this, with Skipjack having been designed to be just as strong as it needed to be and no stronger). But we do know how to select an RNG that meets some minimum strength criteria and is faster than the fast RNG we already have. And that is what we are doing in this discussion. Thor
Re: cprng_fast implementation benchmarks
On Tue, Apr 22, 2014 at 11:59:38PM -0400, Thor Lancelot Simon wrote: I believe ChaCha8 is suitable for our purpose: we were previously considering ciphers with, at most, 128-bit security, and even 6-round ChaCha has 139-bit strength against the best currently known attack (at present, there is no attack better than brute force on ChaCha8, and the best attack on ChaCha7 is 2^248). ChaCha8 appears to be somewhat faster than the old arc4 implementation. Sounds wrong. When I measured Salsa20/8, it was ~3 times faster than RC4. Code can be found at http://www.netbsd.org/~joerg/arc4random_salsa.c. Joerg
Re: cprng_fast implementation benchmarks
Thor Lancelot Simon t...@panix.com wrote: RESULTS kernelcpb (32 bit)4GB (1 way) 16GB (4 ways) Scaling Factor ----- - -- arc4-mtx 35 42.58 398.83 0.106 arc4-nomtx24 42.12 2338.92 0.018 arc4-percpu 27 33.63 41.59 0.808 hc128-percpu 21 23.75 34.90 0.680 hc128-inline 19 22.66 31.75 0.713 chacha8 22 20.51 30.45 0.662 chacha12 24 24.87 34.32 0.724 chacha20 28 30.45 39.28 0.775 Looks good! You mentioned that 4GB of data are generated by requesting 256 bytes. It would be more interesting to see the throughput of 4 byte requests i.e. how many cprng_fast32() calls per second can we do? My vote would go to a cipher of Salsa20 family. They undergone quite more cryptoanalysis than HC-128. -- Mindaugas
Re: cprng_fast implementation benchmarks
Manuel Bouyer bou...@antioche.eu.org wrote: On Wed, Apr 23, 2014 at 09:16:33AM -0400, Thor Lancelot Simon wrote: [...] Do we still have a compile-time way to check if the kernel (or port) is uniprocessor only? If so we should probably #ifdef away the percpu calls in such kernels, which are probably for slower hardware anyway. AFAIK options MULTIPROCESSOR is still here It is, but it should generally be avoided, unless the optimisation is really significant. Also, make it least invasive, e.g. instead of #ifdef in a function body, do something like: #ifdef MULTIPROCESSOR #define cprng_getctx() percpu_getref(percpu_cprng_fast_ctx) #define cprng_putctx() percpu_putref(percpu_cprng_fast_ctx) #else static cprng_fast_ctx_t cprng_global_ctx; #define cprng_getctx() (cprng_global_ctx) #define cprng_putctx() #endif Not sure if it is worth, though. -- Mindaugas
Re: cprng_fast implementation benchmarks
On Apr 22, 2014, at 11:59 PM, Thor Lancelot Simon t...@panix.com wrote: ... RESULTS kernelcpb (32 bit)4GB (1 way) 16GB (4 ways) Scaling Factor ----- - -- arc4-mtx 35 42.58 398.83 0.106 arc4-nomtx24 42.12 2338.92 0.018 arc4-percpu 27 33.63 41.59 0.808 hc128-percpu 21 23.75 34.90 0.680 hc128-inline 19 22.66 31.75 0.713 chacha8 22 20.51 30.45 0.662 chacha12 24 24.87 34.32 0.724 chacha20 28 30.45 39.28 0.775 I believe ChaCha8 is suitable for our purpose: we were previously considering ciphers with, at most, 128-bit security, and even 6-round ChaCha has 139-bit strength against the best currently known attack (at present, there is no attack better than brute force on ChaCha8, and the best attack on ChaCha7 is 2^248). ChaCha8 appears to be somewhat faster than the old arc4 implementation. I’ve been watching this long stream of messages flying by, and I’m a bit concerned about the approach. As I understand it, there is a strong RNG, based on RC4 (“ARC4”) in the kernel today. It is used by some things that require strong random numbers, and also by things that don’t (or at most, have weaker requirements than the cryptographic operations that require serious strength). It isn’t clear to me what fraction of the workload really requires the cryptographically strong generator. Replacing the existing strong generator by a new one that is faster is tempting. The question is: how much confidence do you need in the new algorithm? For things like port randomization, not much. One might argue that a PRNG is fine for that. On the other hand, for those spots where cryptographic random numbers are required, a lot. It isn’t at all clear to me that the proposed replacements are sufficiently well analyzed. If someone were to propose a replacement whose security is demonstrably that of the AES block cipher, I’d be a whole lot more comfortable, because AES is one of the very few cryptosystems out there that has had significant analysis. (RC4 may be another, though in that case there have been some results that raise concern.) It would probably also be useful to identify the various uses of kernel RNGs, and document clearly what their security requirements (if any) are. paul
Re: cprng_fast implementation benchmarks
On Wed, Apr 23, 2014 at 09:16:33AM -0400, Thor Lancelot Simon wrote: On Wed, Apr 23, 2014 at 10:57:59AM +0200, Joerg Sonnenberger wrote: On Tue, Apr 22, 2014 at 11:59:38PM -0400, Thor Lancelot Simon wrote: I believe ChaCha8 is suitable for our purpose: we were previously considering ciphers with, at most, 128-bit security, and even 6-round ChaCha has 139-bit strength against the best currently known attack (at present, there is no attack better than brute force on ChaCha8, and the best attack on ChaCha7 is 2^248). ChaCha8 appears to be somewhat faster than the old arc4 implementation. Sounds wrong. When I measured Salsa20/8, it was ~3 times faster than RC4. Code can be found at http://www.netbsd.org/~joerg/arc4random_salsa.c. That's a libc implementation -- and were you calling it for 32 bits at a time, or bulk data? I measured either case, extracting 32bits at a time vs doing larger operations. Joerg
Re: cprng_fast implementation benchmarks
On Wed, Apr 23, 2014 at 03:30:09PM +0200, Manuel Bouyer wrote: On Wed, Apr 23, 2014 at 09:16:33AM -0400, Thor Lancelot Simon wrote: [...] Do we still have a compile-time way to check if the kernel (or port) is uniprocessor only? If so we should probably #ifdef away the percpu calls in such kernels, which are probably for slower hardware anyway. AFAIK options MULTIPROCESSOR is still here Do the percpu() calls collapse out for non-MULTIPROCESSOR kernels? In any case you'd want to do what is done with some of the mutex code. ie overwrite the code of the SMP version with that of the uniprocessor one if the current system only has one cpu. David -- David Laight: da...@l8s.co.uk
Re: cprng_fast implementation benchmarks
On Wed, Apr 23, 2014 at 02:21:31PM +, paul_kon...@dell.com wrote: I?ve been watching this long stream of messages flying by, and I?m a bit concerned about the approach. As I understand it, there is a strong RNG, based on RC4 (?ARC4?) in the kernel today. No. There is a strong RNG, based on the NIST SP800-90 CTR_DRBG with AES128 as the block transform. There is also a fast RNG, based on RC4. We are discussing the replacement of the fast RNG. Thor
Re: cprng_fast implementation benchmarks
Date: Sun, 20 Apr 2014 03:18:03 -0400 From: Thor Lancelot Simon t...@panix.com The following tests were run: 1) Cycles-per-byte, 32 bit request: at initial keying of the generator and each automatic rekeying. 2) Generate 4GB of data by requesting 256 bytes at a time from userspace using a small C program that loops around the kern.arandom sysctl call. Values are in seconds. 3) Generate 16GB of data by running 4 copies of the generator program in parallel. Values are in seconds. It's good to confirm by these tests that going per-CPU didn't hurt single-threaded bulk throughput and improved scalability -- so much that just about any halfway reasonable choice of stream cipher would do better than what we have now or had before. What these tests don't measure is the impact on overall system performance, especially via the CPU cache. I don't expect going per-CPU to hurt cachewise, but the difference between RC4's 256-byte state and HC-128's 4 KB state may be noticeable in, say, build.sh, or perhaps high network load triggering cprng_fasts in ALTQ or IPsec or something. Let's test SALSA20 and discuss further. I wrote a little code to implement the cprng_fast API using Salsa20 or ChaCha with buffer for a single cipher core output. The code, along with a little driver to measure single-threaded speed and test vectors to ensure we are computing Salsa20 and ChaCha, is temporarily at https://www.NetBSD.org/~riastradh/tmp/20140421/cprng_fast/. The entire state fits in two 64-byte cache lines -- half what RC4 took. On my Ivy Bridge i7 laptop, I get ~4 cpb throughput in bulk, average of ~50 cycles for each cprng_fast32, and maximum of ~300 cycles. (If you need help reproducing my results, let me know. This is all a userland mockup, of course, but adapting it to the real kernel cprng_fast shouldn't be hard.) This implementation also has the property that the maximum work it does with interrupts blocked is a single cipher core and copying at most 64 bytes, plus some additions/subtractions to count things and maybe schedule a softint. It services long requests by grabbing a temporary key from the buffer while interrupts are blocked, and then generating the rest from the temporary key, which works well because Salsa20 and ChaCha have zero key setup cost. We should run some of these benchmarks on big-endian systems and systems that are architecturally unlike the OOO x86-64 I used for my tests. If I find time I can do some tests on some littleish (by modern standards) arm and powerpc systems, but I'll have to defer to others for the VAXen and toasters.
Re: cprng_fast implementation benchmarks
On 20 Apr, 2014, at 00:18 , Thor Lancelot Simon t...@panix.com wrote: Personally, I'd go with the HC128 variant without the ugly inlining. We restrict bytes-on-key to 512MB -- the best current attack on HC128 requires 2^65 bytes to distinguish its keystream, so I think it's a reasonable choice at this time. Percpu ARC4 is really a strawman; RC4 is just not safe at any speed any longer. Let's test SALSA20 and discuss further. The thing that bothers me about HC-128 is that its output is apparently not very random. This paper http://viper.eng.iastate.edu/gmdev/pubs/tantra.pdf shows the number of failures it gets when run through the TestU01 batteries. While one would need to see the p-values of the failures to know if these are kind of bad or really bad, the fact that even the Small Crush battery is finding stuff (that battery takes under 10 seconds to run) is making this seem, well, not good. I personally might overlook this if there were some offsetting advantages to HC-128, but the fact that the mediocre random numbers it produces are being paid for with 4500 bytes of context is making me think it can't be a good idea. When Salsa20/ChaCha was mentioned I turned ChaCha into a PRNG library, available here http://www.netbsd.org/~dennis/ccrand.tgz if it is useful, and ran the three variations through TestU01 batteries. ChaCha8 produced two failures, one in Crush (about half an hour of CPU time to run) and one in Big Crush (about 3 hours). ChaCha12 produced a single failure, in Crush. I fumble-fingered running ChaCha20 and had to redo that, but it has now run everything except the last half hour of Big Crush with no failures so far. For comparison, arc4random() gets a couple of failures in Big Crush, the Mersenne Twister (which had a good reputation as a fast, high quality non-cryptographic PRNG) also gets a couple failures and ISAAC, my prior favorite PRNG, runs the entire set of batteries clean even though it uses less than half the state space of HC-128 (AES and SHA-1 also run clean, but more slowly). Given that ChaCha seems to be keeping pretty good company in terms of random number quality, and manages to do this securely as well with only 128 bytes of state (i.e. 40 ChaCha instances consume the same state space as one HC-128 instance), I am thinking that this seems like a pretty good idea even if it turns out that ChaCha20 isn't as easy on the CPU as one would prefer. The reason I think minimizing the amount of state an instance of the PRNG requires is important derives from something clearly demonstrated by your measurements, that often the cost of computing something from shared data can be hugely dominated not by the computation but rather by the sharing, in terms the cost of taking locks or in having that data frequently invalidated from your cache or whatever else you do to ensure mutual exclusion. I think the approach you are taking, of maintaining more instances of the data which are shared much less widely, is exactly right, but I don't think per-CPU plus splhigh() is a recipe that works long term. It is correct to point out that the random number generator won't be doing much at splhigh(), but what is incorrect is to assume that the random number generator is the only thing that will need a solution to this problem. If per-CPU+splhigh() is the solution to all such problems then aggregate may end up running at splhigh() a lot, even if each individual application doesn't do much, and I don't think that is a good outcome. So I think it might be useful if someone eventually figured out how to eliminate the splhigh() in favor of even more instances shared even less widely (but more compatibly), say per-CPU, per-interrupt-level or even per-application. If it turns out that having a large number of random number instances is the best arrangement then the cost of having a large number of instances becomes relevant. ChaCha's 128 bytes of state per instance makes it highly desirable, while HC-128's 4500 bytes per instance seems a lot to pay for not very random numbers even if it computes them quickly in benchmarks. Dennis Ferguson