On Sat, Oct 10, 2015 at 02:45:46PM -0400, George Spelvin wrote: > In general, fewer larger pools is better for entropy storage. The only > reason for the current three-pool design is to achieve strict isolation > of the /dev/random pool.
You're absolutely right, and I would like to see if we can get away with keeping with a single pool design. Your idea of using the CPU id as a nonce is a good one, and I think we can do something similar that should be good enough for mixing the hash back into the pool. We can simply use a hash of the CPU id to change the offset where we do the mixing, and this should be good enough to avoid collisions when we do the add-back. HOWEVER. One downside of this approach is that, especially on a NUMA system, the costs of the cache coherency across a single pool which is constantly being modified and shared by all of the NUMA nodes could be a killer, even if we go completely lockless. To that end, Andi, can you try benchmarking the scalability of the patch below? I really hope it will be good enough, since besides using less memory, there are security advantages in not spreading the entropy across N pools. If it isn't, we might be able to play a trick where we sample the r->add_ptr value before we start hashing the pool, and then check to see if it's changed afterwards. If it has, we could skip doing the hash back, which could reduce the cache coherency traffic, since as you point out: > 2d. A more complete solution involves noting that, if there are multiple > concurrent readers, only one has to add back its output to prevent > backtracking, because all of the concurrent reads are equivalent > in that sense. (Even though, because they're salted with separate > nonces like the CPU id, they're not identical.) However, even if we put in that optimization, the primary question is how good is Intel's cache coherency protocols on their NUMA systems? I'm pretty sure this would be a disaster on, say, Sequent's old NUMA machines, but I'm quite confident all of those servers are dead and buried by now. :-) - Ted commit 3cb51896deab45bddc1b8f571b1103eae8f50e0e Author: Theodore Ts'o <ty...@mit.edu> Date: Sat Oct 10 22:03:53 2015 -0400 random: make the reading from the non-blocking pool more scalable Andi Kleen reported a case where a 4 socket system spent >80% of its total CPU time contending on the global urandom nonblocking pool spinlock. While the application could probably have used an own PRNG, it may have valid reasons to use the best possible key for different session keys. Instead of creating separate multiple per-NUMA node non-blocking pools, use a trick suggested by George Spelvin: Entropy is a holographic property of the pool, not located in any particular bit position. And the most basic operation, of reading from the pool, can easily be done by multiple readers at the same time from the same bit pattern. They just need to be salted with distinguishing nonces (CPU IDs will do nicely) to ensure they all get different results.... We use this trick, and in addition use a hash of the cpu id to change where we mix the hash back into the pool to avoid collisions. Since we are already using a lockless technique (cmpxchg) to update the entropy accounting, we don't need to change this around. This technique won't be quite as scalable since on a NUMA node we will still be forcing cache lines to bounce around, but from the perspective of entropy storage we're much better using a single pool rather than spreading it across multiple pools. Signed-off-by: Theodore Ts'o <ty...@mit.edu> diff --git a/drivers/char/random.c b/drivers/char/random.c index d0da5d8..be6b315 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -260,6 +260,7 @@ #include <linux/irq.h> #include <linux/syscalls.h> #include <linux/completion.h> +#include <linux/hash.h> #include <asm/processor.h> #include <asm/uaccess.h> @@ -494,7 +495,9 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in, { unsigned long i, tap1, tap2, tap3, tap4, tap5; int input_rotate; + int is_nonblock = (r == &nonblocking_pool); int wordmask = r->poolinfo->poolwords - 1; + int poolbits = r->poolinfo->poolbitshift - 5; const char *bytes = in; __u32 w; @@ -506,6 +509,8 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in, input_rotate = r->input_rotate; i = r->add_ptr; + if (is_nonblock) + i += hash_32(get_cpu(), poolbits); /* mix one byte at a time to simplify size handling and churn faster */ while (nbytes--) { @@ -534,6 +539,8 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in, r->input_rotate = input_rotate; r->add_ptr = i; + if (is_nonblock) + put_cpu(); } static void __mix_pool_bytes(struct entropy_store *r, const void *in, @@ -1090,6 +1097,7 @@ retry: static void extract_buf(struct entropy_store *r, __u8 *out) { int i; + int is_nonblock = (r == &nonblocking_pool); union { __u32 w[5]; unsigned long l[LONGS(20)]; @@ -1108,9 +1116,12 @@ static void extract_buf(struct entropy_store *r, __u8 *out) break; hash.l[i] = v; } + if (is_nonblock) + hash.w[0] ^= hash_32(get_cpu(), 32); + else + spin_lock_irqsave(&r->lock, flags); /* Generate a hash across the pool, 16 words (512 bits) at a time */ - spin_lock_irqsave(&r->lock, flags); for (i = 0; i < r->poolinfo->poolwords; i += 16) sha_transform(hash.w, (__u8 *)(r->pool + i), workspace); @@ -1124,7 +1135,10 @@ static void extract_buf(struct entropy_store *r, __u8 *out) * hash. */ __mix_pool_bytes(r, hash.w, sizeof(hash.w)); - spin_unlock_irqrestore(&r->lock, flags); + if (is_nonblock) + put_cpu(); + else + spin_unlock_irqrestore(&r->lock, flags); memzero_explicit(workspace, sizeof(workspace)); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/