Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG
On May 4, 2016 2:42:53 PM PDT, John Denker wrote: >On 05/04/2016 12:07 PM, ty...@thunk.org wrote: > >> it doesn't hit the >> UB case which Jeffrey was concerned about. > >That should be good enough for present purposes > >However, in the interests of long-term maintainability, I >would suggest sticking in a comment or assertion saying >that ror32(,shift) is never called with shift=0. This >can be removed if/when bitops.h is upgraded. > >There is a track record of compilers doing Bad Things in >response to UB code, including some very counterintuitive >Bad Things. > >On Wed, May 04, 2016 at 11:29:57AM -0700, H. Peter Anvin wrote: >>> >>> If bitops.h doesn't do the right thing, we need to >>> fix bitops.h. > >Most of the ror and rol functions in linux/bitops.h >should be considered unsafe, as currently implemented. >http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/linux/bitops.h?id=04974df8049fc4240d22759a91e035082ccd18b4#n103 > >I don't see anything in the file that suggests any limits >on the range of the second argument. So at best it is an >undocumented trap for the unwary. This has demonstrably >been a problem in the past. The explanation in the attached >fix-rol32.diff makes amusing reading. > >Of the eight functions > ror64, rol64, ror32, rol32, ror16, rol16, ror8, and rol8, >only one of them can handle shifting by zero, namely rol32. >It was upgraded on Thu Dec 3 22:04:01 2015; see the attached >fix-rol32.diff. > >I find it very odd that the other seven functions were not >upgraded. I suggest the attached fix-others.diff would make >things more consistent. > >Beware that shifting by an amount >= the number of bits in the >word remains Undefined Behavior. This should be either documented >or fixed. It could be fixed easily enough. This construct has been supported as a rotate since at least gcc2. -- Sent from my Android device with K-9 Mail. Please excuse brevity and formatting. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG
On 05/04/2016 12:07 PM, ty...@thunk.org wrote: > it doesn't hit the > UB case which Jeffrey was concerned about. That should be good enough for present purposes However, in the interests of long-term maintainability, I would suggest sticking in a comment or assertion saying that ror32(,shift) is never called with shift=0. This can be removed if/when bitops.h is upgraded. There is a track record of compilers doing Bad Things in response to UB code, including some very counterintuitive Bad Things. On Wed, May 04, 2016 at 11:29:57AM -0700, H. Peter Anvin wrote: >> >> If bitops.h doesn't do the right thing, we need to >> fix bitops.h. Most of the ror and rol functions in linux/bitops.h should be considered unsafe, as currently implemented. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/linux/bitops.h?id=04974df8049fc4240d22759a91e035082ccd18b4#n103 I don't see anything in the file that suggests any limits on the range of the second argument. So at best it is an undocumented trap for the unwary. This has demonstrably been a problem in the past. The explanation in the attached fix-rol32.diff makes amusing reading. Of the eight functions ror64, rol64, ror32, rol32, ror16, rol16, ror8, and rol8, only one of them can handle shifting by zero, namely rol32. It was upgraded on Thu Dec 3 22:04:01 2015; see the attached fix-rol32.diff. I find it very odd that the other seven functions were not upgraded. I suggest the attached fix-others.diff would make things more consistent. Beware that shifting by an amount >= the number of bits in the word remains Undefined Behavior. This should be either documented or fixed. It could be fixed easily enough. commit d7e35dfa2531b53618b9e6edcd8752ce988ac555 Author: Sasha Levin Date: Thu Dec 3 22:04:01 2015 -0500 bitops.h: correctly handle rol32 with 0 byte shift ROL on a 32 bit integer with a shift of 32 or more is undefined and the result is arch-dependent. Avoid this by handling the trivial case of roling by 0 correctly. The trivial solution of checking if shift is 0 breaks gcc's detection of this code as a ROL instruction, which is unacceptable. This bug was reported and fixed in GCC (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57157): The standard rotate idiom, (x << n) | (x >> (32 - n)) is recognized by gcc (for concreteness, I discuss only the case that x is an uint32_t here). However, this is portable C only for n in the range 0 < n < 32. For n == 0, we get x >> 32 which gives undefined behaviour according to the C standard (6.5.7, Bitwise shift operators). To portably support n == 0, one has to write the rotate as something like (x << n) | (x >> ((-n) & 31)) And this is apparently not recognized by gcc. Note that this is broken on older GCCs and will result in slower ROL. Acked-by: Linus Torvalds Signed-off-by: Sasha Levin Signed-off-by: Linus Torvalds diff --git a/include/linux/bitops.h b/include/linux/bitops.h index 2b8ed12..defeaac 100644 --- a/include/linux/bitops.h +++ b/include/linux/bitops.h @@ -107,7 +107,7 @@ static inline __u64 ror64(__u64 word, unsigned int shift) */ static inline __u32 rol32(__u32 word, unsigned int shift) { - return (word << shift) | (word >> (32 - shift)); + return (word << shift) | (word >> ((-shift) & 31)); } /** commit 03b97eeb5401ede1e1d7b6fbf6a9575db8d0efa6 Author: John Denker Date: Wed May 4 13:55:51 2016 -0700 Make ror64, rol64, ror32, ror16, rol16, ror8, and rol8 consistent with rol32 in their handling of shifting by a zero amount. Same overall rationale as in d7e35dfa, just more consistently applied. Beware that shifting by an amount >= the number of bits in the word remains Undefined Behavior. This should be either documented or fixed. It could be fixed easily enough. diff --git a/include/linux/bitops.h b/include/linux/bitops.h index defeaac..97096b4 100644 --- a/include/linux/bitops.h +++ b/include/linux/bitops.h @@ -87,7 +87,7 @@ static __always_inline unsigned long hweight_long(unsigned long w) */ static inline __u64 rol64(__u64 word, unsigned int shift) { - return (word << shift) | (word >> (64 - shift)); + return (word << shift) | (word >> ((-shift) & 64)); } /** @@ -97,7 +97,7 @@ static inline __u64 rol64(__u64 word, unsigned int shift) */ static inline __u64 ror64(__u64 word, unsigned int shift) { - return (word >> shift) | (word << (64 - shift)); + return (word >> shift) | (word << ((-shift) & 64)); } /** @@ -117,7 +117,7 @@ static inline __u32 rol32(__u32 word, unsigned int shift) */ static inline __u32 ror32(__u32 word, unsigned int shift) { - return (word >> shift) | (word << (32 - shift)); + return (word >> shift) | (word << ((-shift) & 31)); } /** @@ -127,7 +127,7 @@ static inline __u32 ror32(__u32 word, unsigned int shift) */ sta
Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG
On 05/04/2016 12:07 PM, ty...@thunk.org wrote: > > If we are all agreed that what is in bitops.h is considered valid, > then we can start converting people over to using the version defined > in bitops.h, and if there is some compiler issue we need to work > around, at least we only need to put the workaround in a single header > file. > Yes, please. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG
On Wed, May 04, 2016 at 11:29:57AM -0700, H. Peter Anvin wrote: > > We don't care about UB, we care about gcc, and to a lesser extent > LLVM and ICC. If bitops.h doesn't do the right thing, we need to > fix bitops.h. I'm going to suggest that we treat the ro[rl]{32,64}() question as separable from the /dev/random case. I've sanity checked gcc 5.3.1 and it does the right thing given the small number of constant arguments given to rotl32() in lib/chacha20.c, and it doesn't hit the UB case which Jeffrey was concerned about. This is also code that was previously in crypto/chacha20_generic.c, and so if there is a bug with some obscure compiler not properly compiling it down to a rotate instruction, (a) no one is paying me to make sure the kernel code compiles well on ICC, and (b) it's not scalable to have each kernel developer try to deal with the vagrancies of compilers. So from my perspective, the only interesting results for me is (a) using what had been used before with crypto/chacha20_generic.c, or (b) reusing what is in bitops.h and letting it be someone else's problem if some obscure compiler isn't happy with what is in bitops.h If we are all agreed that what is in bitops.h is considered valid, then we can start converting people over to using the version defined in bitops.h, and if there is some compiler issue we need to work around, at least we only need to put the workaround in a single header file. Cheers, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG
On May 4, 2016 11:22:25 AM PDT, Jeffrey Walton wrote: >On Wed, May 4, 2016 at 1:49 PM, wrote: >> On Wed, May 04, 2016 at 10:40:20AM -0400, Jeffrey Walton wrote: >>> > +static inline u32 rotl32(u32 v, u8 n) >>> > +{ >>> > + return (v << n) | (v >> (sizeof(v) * 8 - n)); >>> > +} >>> >>> That's undefined behavior when n=0. >> >> Sure, but it's never called with n = 0; I've double checked and the >> compiler seems to do the right thing with the above pattern as well. > >> Hmm, it looks like there is a "standard" version rotate left and >right >> defined in include/linux/bitops.h. So I suspect it would make sense >> to use rol32 as defined in bitops.h --- and this is probably >something > >bitops.h could work in this case, but its not an ideal solution. GCC >does not optimize the code below as expected under all use cases >because GCC does not recognize it as a rotate (see >http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57157): > >return (v << n) | (v >> (sizeof(v) * 8 - n)); > >And outside of special cases like Salsa, ChaCha and BLAKE2, the code >provided in bitops.h suffers UB on arbitrary data. So I think care >needs to be taken when selecting functions from bitops.h. > >> that we should do for the rest of crypto/*.c, where people seem to be >> defininig their own version of something like rotl32 (I copied the >> contents of crypto/chacha20_generic.c to lib/chacha20, so this >pattern >> of defining one's own version of rol32 isn't new). > >Yeah, I kind of thought there was some duplication going on. > >But I think bitops.h should be fixed. Many folks don't realize the >lurking UB, and many folks don't realize its not always optimized >well. > >>> I think the portable way to do a rotate that avoids UB is the >>> following. GCC, Clang and ICC recognize the pattern, and emit a >rotate >>> instruction. >>> >>> static const unsigned int MASK=31; >>> return (v<>(-n&MASK)); >>> >>> You should also avoid the following because its not constant time >due >>> to the branch: >>> >>> return n == 0 ? v : (v << n) | (v >> (sizeof(v) * 8 - n)); >>> >> >> Where is this coming from? I don't see this construct in the patch. > >My bad... It was a general observation. I've seen folks try to correct >the UB by turning to something like that. > >Jeff We don't care about UB, we care about gcc, and to a lesser extent LLVM and ICC. If bitops.h doesn't do the right thing, we need to fix bitops.h. -- Sent from my Android device with K-9 Mail. Please excuse brevity and formatting. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG
On Wed, May 4, 2016 at 1:49 PM, wrote: > On Wed, May 04, 2016 at 10:40:20AM -0400, Jeffrey Walton wrote: >> > +static inline u32 rotl32(u32 v, u8 n) >> > +{ >> > + return (v << n) | (v >> (sizeof(v) * 8 - n)); >> > +} >> >> That's undefined behavior when n=0. > > Sure, but it's never called with n = 0; I've double checked and the > compiler seems to do the right thing with the above pattern as well. > Hmm, it looks like there is a "standard" version rotate left and right > defined in include/linux/bitops.h. So I suspect it would make sense > to use rol32 as defined in bitops.h --- and this is probably something bitops.h could work in this case, but its not an ideal solution. GCC does not optimize the code below as expected under all use cases because GCC does not recognize it as a rotate (see http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57157): return (v << n) | (v >> (sizeof(v) * 8 - n)); And outside of special cases like Salsa, ChaCha and BLAKE2, the code provided in bitops.h suffers UB on arbitrary data. So I think care needs to be taken when selecting functions from bitops.h. > that we should do for the rest of crypto/*.c, where people seem to be > defininig their own version of something like rotl32 (I copied the > contents of crypto/chacha20_generic.c to lib/chacha20, so this pattern > of defining one's own version of rol32 isn't new). Yeah, I kind of thought there was some duplication going on. But I think bitops.h should be fixed. Many folks don't realize the lurking UB, and many folks don't realize its not always optimized well. >> I think the portable way to do a rotate that avoids UB is the >> following. GCC, Clang and ICC recognize the pattern, and emit a rotate >> instruction. >> >> static const unsigned int MASK=31; >> return (v<>(-n&MASK)); >> >> You should also avoid the following because its not constant time due >> to the branch: >> >> return n == 0 ? v : (v << n) | (v >> (sizeof(v) * 8 - n)); >> > > Where is this coming from? I don't see this construct in the patch. My bad... It was a general observation. I've seen folks try to correct the UB by turning to something like that. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG
On May 4, 2016 10:30:41 AM PDT, ty...@mit.edu wrote: >On Tue, May 03, 2016 at 10:50:25AM +0200, Stephan Mueller wrote: >> > +/* >> > + * crng_init = 0 --> Uninitialized >> > + *2 --> Initialized >> > + *3 --> Initialized from input_pool >> > + */ >> > +static int crng_init = 0; >> >> shouldn't that be an atomic_t ? > >The crng_init variable only gets incremented (it progresses from >0->1->2->3) and it's protected by the primary_crng->lock spinlock. >There are a few places where we read it without first taking the lock, >but that's where we depend on it getting incremented and where if we >race with another CPU which has just bumped the crng_init status, it's >not critical. (Or we can take the lock and then recheck the crng_init >if we really need to be sure.) > >> > +static void _initialize_crng(struct crng_state *crng) >> > +{ >> > + int i; >> > + unsigned long rv; >> >> Why do you use unsigned long here? I thought the state[i] is unsigned >int. > >Because it gets filled in via arch_get_random_seed_long(&rv) and >arch_get_random_log(&rv). If that means we get 64 bits and we only >use 32 bits, that's OK > >> Would it make sense to add the ChaCha20 self test vectors from >RFC7539 here to >> test that the ChaCha20 works? > >We're not doing that for any of the other ciphers and hashes. We >didn't do that for SHA1, and the chacha20 code where I took this from >didn't check for this as well. What threat are you most worried >about. We don't care about chacha20 being exactly chacha20, so long >as it's cryptographically strong. In fact I think I removed a >potential host order byteswap in the set key operation specifically >because we don't care and interop. > >If this is required for FIPS mode, we can add that later. I was >primarily trying to keep the patch small to be easier for people to >look at it, so I've deliberately left off bells and whistles that >aren't strictly needed to show that the new approach is sound. > >> > + if (crng_init++ >= 2) >> > + wake_up_interruptible(&crng_init_wait); >> >> Don't we have a race here with the crng_init < 3 check in crng_reseed > >> considering multi-core systems? > >No, because we are holding the primary_crng->lock spinlock. I'll add >a comment explaining the locking protections which is intended for >crng_init where we declare it. > > >> > + if (num < 16 || num > 32) { >> > + WARN_ON(1); >> > + pr_err("crng_reseed: num is %d?!?\n", num); >> > + } >> > + num_words = (num + 3) / 4; >> > + for (i = 0; i < num_words; i++) >> > + primary_crng.state[i+4] ^= tmp[i]; >> > + primary_crng.init_time = jiffies; >> > + if (crng_init < 3) { >> >> Shouldn't that one be if (crng_init < 3 && num >= 16) ? > >I'll just change the above WRN_ON test to be: > > BUG_ON(num < 16 || num > 32); > >This really can't happen, and I had it as a WARN_ON with a printk for >debugging purpose in case I was wrong about how the code works. > >> > + crng_init = 3; >> > + process_random_ready_list(); >> > + wake_up_interruptible(&crng_init_wait); >> > + pr_notice("random: crng_init 3\n"); >> >> Would it make sense to be more descriptive here to allow readers of >dmesg to >> understand the output? > >Yes, what we're currently printing during the initialization: > >random: crng_init 1 >random: crng_init 2 >random: crng_init 3 > >was again mostly for debugging purposes. What I'm thinking about >doing is changing crng_init 2 and 3 messages to be: > >random: crng fast init done >random: crng conservative init done > >> > + } >> > + ret = 1; >> > +out: >> > + spin_unlock_irqrestore(&primary_crng.lock, flags); >> >> memzero_explicit of tmp? > >Good point, I've added a call to memzero_explicit(). > >> > +static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) >> > +{ >> > + unsigned long v, flags; >> > + struct crng_state *crng = &primary_crng; >> > + >> > + if (crng_init > 2 && >> > + time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) >> > + crng_reseed(&input_pool); >> > + spin_lock_irqsave(&crng->lock, flags); >> > + if (arch_get_random_long(&v)) >> > + crng->state[14] ^= v; >> > + chacha20_block(&crng->state[0], out); >> > + if (crng->state[12] == 0) >> > + crng->state[13]++; > >> What is the purpose to only cover the 2nd 32 bit value of the nonce >with >> arch_get_random? >> >> state[12]++? Or why do you increment the nonce? > >In Chacha20, its output is a funcrtion of the state array, where >state[0..3] is a constant specified by the Chacha20 definition, >state[4..11] is the Key, and state[12..15] is the IV. The >chacha20_block() function increments state[12] each time it is called, >so state[12] is being used as the block counter. The increment of >state[13] is used to make state[13] to be the upper 32-bits of the >block counter. We *should* be reseeding more often than 2**32 calls >to chacha20_block
Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG
On Wed, May 04, 2016 at 10:40:20AM -0400, Jeffrey Walton wrote: > > +static inline u32 rotl32(u32 v, u8 n) > > +{ > > + return (v << n) | (v >> (sizeof(v) * 8 - n)); > > +} > > That's undefined behavior when n=0. Sure, but it's never called with n = 0; I've double checked and the compiler seems to do the right thing with the above pattern as well. Hmm, it looks like there is a "standard" version rotate left and right defined in include/linux/bitops.h. So I suspect it would make sense to use rol32 as defined in bitops.h --- and this is probably something that we should do for the rest of crypto/*.c, where people seem to be defininig their own version of something like rotl32 (I copied the contents of crypto/chacha20_generic.c to lib/chacha20, so this pattern of defining one's own version of rol32 isn't new). > I think the portable way to do a rotate that avoids UB is the > following. GCC, Clang and ICC recognize the pattern, and emit a rotate > instruction. > > static const unsigned int MASK=31; > return (v<>(-n&MASK)); > > You should also avoid the following because its not constant time due > to the branch: > > return n == 0 ? v : (v << n) | (v >> (sizeof(v) * 8 - n)); > Where is this coming from? I don't see this construct in the patch. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG
On Tue, May 03, 2016 at 10:50:25AM +0200, Stephan Mueller wrote: > > +/* > > + * crng_init = 0 --> Uninitialized > > + * 2 --> Initialized > > + * 3 --> Initialized from input_pool > > + */ > > +static int crng_init = 0; > > shouldn't that be an atomic_t ? The crng_init variable only gets incremented (it progresses from 0->1->2->3) and it's protected by the primary_crng->lock spinlock. There are a few places where we read it without first taking the lock, but that's where we depend on it getting incremented and where if we race with another CPU which has just bumped the crng_init status, it's not critical. (Or we can take the lock and then recheck the crng_init if we really need to be sure.) > > +static void _initialize_crng(struct crng_state *crng) > > +{ > > + int i; > > + unsigned long rv; > > Why do you use unsigned long here? I thought the state[i] is unsigned int. Because it gets filled in via arch_get_random_seed_long(&rv) and arch_get_random_log(&rv). If that means we get 64 bits and we only use 32 bits, that's OK > Would it make sense to add the ChaCha20 self test vectors from RFC7539 here > to > test that the ChaCha20 works? We're not doing that for any of the other ciphers and hashes. We didn't do that for SHA1, and the chacha20 code where I took this from didn't check for this as well. What threat are you most worried about. We don't care about chacha20 being exactly chacha20, so long as it's cryptographically strong. In fact I think I removed a potential host order byteswap in the set key operation specifically because we don't care and interop. If this is required for FIPS mode, we can add that later. I was primarily trying to keep the patch small to be easier for people to look at it, so I've deliberately left off bells and whistles that aren't strictly needed to show that the new approach is sound. > > + if (crng_init++ >= 2) > > + wake_up_interruptible(&crng_init_wait); > > Don't we have a race here with the crng_init < 3 check in crng_reseed > considering multi-core systems? No, because we are holding the primary_crng->lock spinlock. I'll add a comment explaining the locking protections which is intended for crng_init where we declare it. > > + if (num < 16 || num > 32) { > > + WARN_ON(1); > > + pr_err("crng_reseed: num is %d?!?\n", num); > > + } > > + num_words = (num + 3) / 4; > > + for (i = 0; i < num_words; i++) > > + primary_crng.state[i+4] ^= tmp[i]; > > + primary_crng.init_time = jiffies; > > + if (crng_init < 3) { > > Shouldn't that one be if (crng_init < 3 && num >= 16) ? I'll just change the above WRN_ON test to be: BUG_ON(num < 16 || num > 32); This really can't happen, and I had it as a WARN_ON with a printk for debugging purpose in case I was wrong about how the code works. > > + crng_init = 3; > > + process_random_ready_list(); > > + wake_up_interruptible(&crng_init_wait); > > + pr_notice("random: crng_init 3\n"); > > Would it make sense to be more descriptive here to allow readers of dmesg to > understand the output? Yes, what we're currently printing during the initialization: random: crng_init 1 random: crng_init 2 random: crng_init 3 was again mostly for debugging purposes. What I'm thinking about doing is changing crng_init 2 and 3 messages to be: random: crng fast init done random: crng conservative init done > > + } > > + ret = 1; > > +out: > > + spin_unlock_irqrestore(&primary_crng.lock, flags); > > memzero_explicit of tmp? Good point, I've added a call to memzero_explicit(). > > +static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) > > +{ > > + unsigned long v, flags; > > + struct crng_state *crng = &primary_crng; > > + > > + if (crng_init > 2 && > > + time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) > > + crng_reseed(&input_pool); > > + spin_lock_irqsave(&crng->lock, flags); > > + if (arch_get_random_long(&v)) > > + crng->state[14] ^= v; > > + chacha20_block(&crng->state[0], out); > > + if (crng->state[12] == 0) > > + crng->state[13]++; > What is the purpose to only cover the 2nd 32 bit value of the nonce with > arch_get_random? > > state[12]++? Or why do you increment the nonce? In Chacha20, its output is a funcrtion of the state array, where state[0..3] is a constant specified by the Chacha20 definition, state[4..11] is the Key, and state[12..15] is the IV. The chacha20_block() function increments state[12] each time it is called, so state[12] is being used as the block counter. The increment of state[13] is used to make state[13] to be the upper 32-bits of the block counter. We *should* be reseeding more often than 2**32 calls to chacha20_block(), and the increment is just to be safe in case something goes wronng and we're not reseeding. We're using crng[14] to be contain the output of RDRAND, so this is where
Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG
>> + chacha20_block(&crng->state[0], out); >> + if (crng->state[12] == 0) >> + crng->state[13]++; > > state[12]++? Or why do you increment the nonce? In Bernstein's Salsa and ChaCha, the counter is 64-bit. It appears ChaCha-TLS uses a 32-bit counter, and the other 32-bits is given to the nonce. Maybe the first question to ask is, what ChaCha is the kernel providing? If its ChaCha-TLS, then the carry does not make a lot of sense. If the generator is limiting the amount of material under a given set of security parameters (key and nonce), then the generator will likely re-key itself long before the 256-GB induced wrap. In this case, it does not matter which ChaCha the kernel is providing and the carry is superfluous. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG
> +static inline u32 rotl32(u32 v, u8 n) > +{ > + return (v << n) | (v >> (sizeof(v) * 8 - n)); > +} That's undefined behavior when n=0. I think the portable way to do a rotate that avoids UB is the following. GCC, Clang and ICC recognize the pattern, and emit a rotate instruction. static const unsigned int MASK=31; return (v<>(-n&MASK)); You should also avoid the following because its not constant time due to the branch: return n == 0 ? v : (v << n) | (v >> (sizeof(v) * 8 - n)); Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG
Am Dienstag, 3. Mai 2016, 11:36:12 schrieb Stephan Mueller: Hi Ted, > > + > > +static ssize_t extract_crng_user(void __user *buf, size_t nbytes) > > +{ > > + ssize_t ret = 0, i; > > + __u8 tmp[CHACHA20_BLOCK_SIZE]; > > + int large_request = (nbytes > 256); > > + > > + while (nbytes) { > > + if (large_request && need_resched()) { > > + if (signal_pending(current)) { > > + if (ret == 0) > > + ret = -ERESTARTSYS; > > + break; > > + } > > + schedule(); > > + } > > + > > + extract_crng(tmp); > > + i = min_t(int, nbytes, CHACHA20_BLOCK_SIZE); > > + if (copy_to_user(buf, tmp, i)) { > > + ret = -EFAULT; > > + break; > > + } > > + > > + nbytes -= i; > > + buf += i; > > + ret += i; > > + } > > + > > + /* Wipe data just written to memory */ > > + memzero_explicit(tmp, sizeof(tmp)); > > Would it make sense to add another chacha20_block() call here at the end? > Note, the one thing about the SP800-90A DRBG I really like is the enhanced > backward secrecy support which is implemented by "updating" the internal > state (the key / state) used for one or more random number generation > rounds after one request for random numbers is satisfied. > > This means that even if the state becomes known or the subsequent caller > manages to deduce the state of the RNG to some degree of confidence, he > cannot backtrack the already generated random numbers. > > I see that the ChaCha20 RNG implicitly updates its state while it operates. > But for the last round of the RNG, there is no more shuffling of the > internal state. As one round is 64 bytes in size and many callers just want > 16 or 32 bytes (as seen during testing), a lot of callers trigger only one > round of the RNG. After doing some performance tests, I see that we reach a performance of north of 200 MB/s on my system (compare that to 12 MB/s for the SHA-1 version). Thus, I would assume adding another call to chacha20_block should not hurt. Ciao Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG
Am Montag, 2. Mai 2016, 02:26:51 schrieb Theodore Ts'o: Hi Theodore, One more item. > The CRNG is faster, and we don't pretend to track entropy usage in the > CRNG any more. > > Signed-off-by: Theodore Ts'o > --- > crypto/chacha20_generic.c | 61 -- > drivers/char/random.c | 282 > ++ include/crypto/chacha20.h | > 1 + > lib/Makefile | 2 +- > lib/chacha20.c| 79 + > 5 files changed, 294 insertions(+), 131 deletions(-) > create mode 100644 lib/chacha20.c > > diff --git a/crypto/chacha20_generic.c b/crypto/chacha20_generic.c > index da9c899..1cab831 100644 > --- a/crypto/chacha20_generic.c > +++ b/crypto/chacha20_generic.c > @@ -15,72 +15,11 @@ > #include > #include > > -static inline u32 rotl32(u32 v, u8 n) > -{ > - return (v << n) | (v >> (sizeof(v) * 8 - n)); > -} > - > static inline u32 le32_to_cpuvp(const void *p) > { > return le32_to_cpup(p); > } > > -static void chacha20_block(u32 *state, void *stream) > -{ > - u32 x[16], *out = stream; > - int i; > - > - for (i = 0; i < ARRAY_SIZE(x); i++) > - x[i] = state[i]; > - > - for (i = 0; i < 20; i += 2) { > - x[0] += x[4];x[12] = rotl32(x[12] ^ x[0], 16); > - x[1] += x[5];x[13] = rotl32(x[13] ^ x[1], 16); > - x[2] += x[6];x[14] = rotl32(x[14] ^ x[2], 16); > - x[3] += x[7];x[15] = rotl32(x[15] ^ x[3], 16); > - > - x[8] += x[12]; x[4] = rotl32(x[4] ^ x[8], 12); > - x[9] += x[13]; x[5] = rotl32(x[5] ^ x[9], 12); > - x[10] += x[14]; x[6] = rotl32(x[6] ^ x[10], 12); > - x[11] += x[15]; x[7] = rotl32(x[7] ^ x[11], 12); > - > - x[0] += x[4];x[12] = rotl32(x[12] ^ x[0], 8); > - x[1] += x[5];x[13] = rotl32(x[13] ^ x[1], 8); > - x[2] += x[6];x[14] = rotl32(x[14] ^ x[2], 8); > - x[3] += x[7];x[15] = rotl32(x[15] ^ x[3], 8); > - > - x[8] += x[12]; x[4] = rotl32(x[4] ^ x[8], 7); > - x[9] += x[13]; x[5] = rotl32(x[5] ^ x[9], 7); > - x[10] += x[14]; x[6] = rotl32(x[6] ^ x[10], 7); > - x[11] += x[15]; x[7] = rotl32(x[7] ^ x[11], 7); > - > - x[0] += x[5];x[15] = rotl32(x[15] ^ x[0], 16); > - x[1] += x[6];x[12] = rotl32(x[12] ^ x[1], 16); > - x[2] += x[7];x[13] = rotl32(x[13] ^ x[2], 16); > - x[3] += x[4];x[14] = rotl32(x[14] ^ x[3], 16); > - > - x[10] += x[15]; x[5] = rotl32(x[5] ^ x[10], 12); > - x[11] += x[12]; x[6] = rotl32(x[6] ^ x[11], 12); > - x[8] += x[13]; x[7] = rotl32(x[7] ^ x[8], 12); > - x[9] += x[14]; x[4] = rotl32(x[4] ^ x[9], 12); > - > - x[0] += x[5];x[15] = rotl32(x[15] ^ x[0], 8); > - x[1] += x[6];x[12] = rotl32(x[12] ^ x[1], 8); > - x[2] += x[7];x[13] = rotl32(x[13] ^ x[2], 8); > - x[3] += x[4];x[14] = rotl32(x[14] ^ x[3], 8); > - > - x[10] += x[15]; x[5] = rotl32(x[5] ^ x[10], 7); > - x[11] += x[12]; x[6] = rotl32(x[6] ^ x[11], 7); > - x[8] += x[13]; x[7] = rotl32(x[7] ^ x[8], 7); > - x[9] += x[14]; x[4] = rotl32(x[4] ^ x[9], 7); > - } > - > - for (i = 0; i < ARRAY_SIZE(x); i++) > - out[i] = cpu_to_le32(x[i] + state[i]); > - > - state[12]++; > -} > - > static void chacha20_docrypt(u32 *state, u8 *dst, const u8 *src, >unsigned int bytes) > { > diff --git a/drivers/char/random.c b/drivers/char/random.c > index b583e53..95f4451 100644 > --- a/drivers/char/random.c > +++ b/drivers/char/random.c > @@ -260,6 +260,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -412,6 +413,15 @@ static struct fasync_struct *fasync; > static DEFINE_SPINLOCK(random_ready_list_lock); > static LIST_HEAD(random_ready_list); > > +/* > + * crng_init = 0 --> Uninitialized > + * 2 --> Initialized > + * 3 --> Initialized from input_pool > + */ > +static int crng_init = 0; > +#define crng_ready() (likely(crng_init >= 2)) > +static void process_random_ready_list(void); > + > /** > * > * OS independent entropy store. Here are the functions which handle > @@ -441,10 +451,13 @@ struct entropy_store { > __u8 last_data[EXTRACT_SIZE]; > }; > > +static ssize_t extract_entropy(struct entropy_store *r, void *buf, > +size_t nbytes, int min, int rsvd); > + > +static int crng_reseed(struct entropy_store *r); > static void push_to_pool(struct work_struct *work); > static __u32 input_pool_data[INPUT_POOL_WORDS]; > static __u32 blocking_pool_data[OUTPUT_POOL_
Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG
Am Montag, 2. Mai 2016, 02:26:51 schrieb Theodore Ts'o: Hi Theodore, > The CRNG is faster, and we don't pretend to track entropy usage in the > CRNG any more. In general, I have no concerns with this approach either. And thank you that some of my concerns are addressed. There are few more concerns left open. I would suggest I would write them up with a proposal on how to address them. Some comments inlne: > > Signed-off-by: Theodore Ts'o > --- > crypto/chacha20_generic.c | 61 -- > drivers/char/random.c | 282 > ++ include/crypto/chacha20.h | > 1 + > lib/Makefile | 2 +- > lib/chacha20.c| 79 + > 5 files changed, 294 insertions(+), 131 deletions(-) > create mode 100644 lib/chacha20.c > > diff --git a/crypto/chacha20_generic.c b/crypto/chacha20_generic.c > index da9c899..1cab831 100644 > --- a/crypto/chacha20_generic.c > +++ b/crypto/chacha20_generic.c > @@ -15,72 +15,11 @@ > #include > #include > > -static inline u32 rotl32(u32 v, u8 n) > -{ > - return (v << n) | (v >> (sizeof(v) * 8 - n)); > -} > - > static inline u32 le32_to_cpuvp(const void *p) > { > return le32_to_cpup(p); > } > > -static void chacha20_block(u32 *state, void *stream) > -{ > - u32 x[16], *out = stream; > - int i; > - > - for (i = 0; i < ARRAY_SIZE(x); i++) > - x[i] = state[i]; > - > - for (i = 0; i < 20; i += 2) { > - x[0] += x[4];x[12] = rotl32(x[12] ^ x[0], 16); > - x[1] += x[5];x[13] = rotl32(x[13] ^ x[1], 16); > - x[2] += x[6];x[14] = rotl32(x[14] ^ x[2], 16); > - x[3] += x[7];x[15] = rotl32(x[15] ^ x[3], 16); > - > - x[8] += x[12]; x[4] = rotl32(x[4] ^ x[8], 12); > - x[9] += x[13]; x[5] = rotl32(x[5] ^ x[9], 12); > - x[10] += x[14]; x[6] = rotl32(x[6] ^ x[10], 12); > - x[11] += x[15]; x[7] = rotl32(x[7] ^ x[11], 12); > - > - x[0] += x[4];x[12] = rotl32(x[12] ^ x[0], 8); > - x[1] += x[5];x[13] = rotl32(x[13] ^ x[1], 8); > - x[2] += x[6];x[14] = rotl32(x[14] ^ x[2], 8); > - x[3] += x[7];x[15] = rotl32(x[15] ^ x[3], 8); > - > - x[8] += x[12]; x[4] = rotl32(x[4] ^ x[8], 7); > - x[9] += x[13]; x[5] = rotl32(x[5] ^ x[9], 7); > - x[10] += x[14]; x[6] = rotl32(x[6] ^ x[10], 7); > - x[11] += x[15]; x[7] = rotl32(x[7] ^ x[11], 7); > - > - x[0] += x[5];x[15] = rotl32(x[15] ^ x[0], 16); > - x[1] += x[6];x[12] = rotl32(x[12] ^ x[1], 16); > - x[2] += x[7];x[13] = rotl32(x[13] ^ x[2], 16); > - x[3] += x[4];x[14] = rotl32(x[14] ^ x[3], 16); > - > - x[10] += x[15]; x[5] = rotl32(x[5] ^ x[10], 12); > - x[11] += x[12]; x[6] = rotl32(x[6] ^ x[11], 12); > - x[8] += x[13]; x[7] = rotl32(x[7] ^ x[8], 12); > - x[9] += x[14]; x[4] = rotl32(x[4] ^ x[9], 12); > - > - x[0] += x[5];x[15] = rotl32(x[15] ^ x[0], 8); > - x[1] += x[6];x[12] = rotl32(x[12] ^ x[1], 8); > - x[2] += x[7];x[13] = rotl32(x[13] ^ x[2], 8); > - x[3] += x[4];x[14] = rotl32(x[14] ^ x[3], 8); > - > - x[10] += x[15]; x[5] = rotl32(x[5] ^ x[10], 7); > - x[11] += x[12]; x[6] = rotl32(x[6] ^ x[11], 7); > - x[8] += x[13]; x[7] = rotl32(x[7] ^ x[8], 7); > - x[9] += x[14]; x[4] = rotl32(x[4] ^ x[9], 7); > - } > - > - for (i = 0; i < ARRAY_SIZE(x); i++) > - out[i] = cpu_to_le32(x[i] + state[i]); > - > - state[12]++; > -} > - > static void chacha20_docrypt(u32 *state, u8 *dst, const u8 *src, >unsigned int bytes) > { > diff --git a/drivers/char/random.c b/drivers/char/random.c > index b583e53..95f4451 100644 > --- a/drivers/char/random.c > +++ b/drivers/char/random.c > @@ -260,6 +260,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -412,6 +413,15 @@ static struct fasync_struct *fasync; > static DEFINE_SPINLOCK(random_ready_list_lock); > static LIST_HEAD(random_ready_list); > > +/* > + * crng_init = 0 --> Uninitialized > + * 2 --> Initialized > + * 3 --> Initialized from input_pool > + */ > +static int crng_init = 0; shouldn't that be an atomic_t ? > +#define crng_ready() (likely(crng_init >= 2)) > +static void process_random_ready_list(void); > + > /** > * > * OS independent entropy store. Here are the functions which handle > @@ -441,10 +451,13 @@ struct entropy_store { > __u8 last_data[EXTRACT_SIZE]; > }; > > +static ssize_t extract_entropy(struct entropy_store *r, void *buf, >