Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG

2016-05-04 Thread H. Peter Anvin
On May 4, 2016 2:42:53 PM PDT, John Denker  wrote:
>On 05/04/2016 12:07 PM, ty...@thunk.org wrote:
>
>> it doesn't hit the
>> UB case which Jeffrey was concerned about. 
>
>That should be good enough for present purposes
>
>However, in the interests of long-term maintainability, I
>would suggest sticking in a comment or assertion saying 
>that ror32(,shift) is never called with shift=0.  This
>can be removed if/when bitops.h is upgraded.
>
>There is a track record of compilers doing Bad Things in
>response to UB code, including some very counterintuitive
>Bad Things.
>
>On Wed, May 04, 2016 at 11:29:57AM -0700, H. Peter Anvin wrote:
>>>
>>> If bitops.h doesn't do the right thing, we need to
>>> fix bitops.h.
>
>Most of the ror and rol functions in linux/bitops.h
>should be considered unsafe, as currently implemented.
>http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/linux/bitops.h?id=04974df8049fc4240d22759a91e035082ccd18b4#n103
>
>I don't see anything in the file that suggests any limits
>on the range of the second argument.  So at best it is an
>undocumented trap for the unwary.  This has demonstrably
>been a problem in the past.  The explanation in the attached
>fix-rol32.diff makes amusing reading.
>
>Of the eight functions
>  ror64, rol64, ror32, rol32, ror16, rol16, ror8, and rol8,
>only one of them can handle shifting by zero, namely rol32.
>It was upgraded on Thu Dec 3 22:04:01 2015; see the attached
>fix-rol32.diff.
>
>I find it very odd that the other seven functions were not
>upgraded.  I suggest the attached fix-others.diff would make
>things more consistent.
>
>Beware that shifting by an amount >= the number of bits in the
>word remains Undefined Behavior.  This should be either documented
>or fixed.  It could be fixed easily enough.

This construct has been supported as a rotate since at least gcc2.
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG

2016-05-04 Thread John Denker
On 05/04/2016 12:07 PM, ty...@thunk.org wrote:

> it doesn't hit the
> UB case which Jeffrey was concerned about. 

That should be good enough for present purposes

However, in the interests of long-term maintainability, I
would suggest sticking in a comment or assertion saying 
that ror32(,shift) is never called with shift=0.  This
can be removed if/when bitops.h is upgraded.

There is a track record of compilers doing Bad Things in
response to UB code, including some very counterintuitive
Bad Things.

On Wed, May 04, 2016 at 11:29:57AM -0700, H. Peter Anvin wrote:
>>
>> If bitops.h doesn't do the right thing, we need to
>> fix bitops.h.

Most of the ror and rol functions in linux/bitops.h
should be considered unsafe, as currently implemented.
   
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/linux/bitops.h?id=04974df8049fc4240d22759a91e035082ccd18b4#n103

I don't see anything in the file that suggests any limits
on the range of the second argument.  So at best it is an
undocumented trap for the unwary.  This has demonstrably
been a problem in the past.  The explanation in the attached
fix-rol32.diff makes amusing reading.

Of the eight functions
  ror64, rol64, ror32, rol32, ror16, rol16, ror8, and rol8,
only one of them can handle shifting by zero, namely rol32.
It was upgraded on Thu Dec 3 22:04:01 2015; see the attached
fix-rol32.diff.

I find it very odd that the other seven functions were not
upgraded.  I suggest the attached fix-others.diff would make
things more consistent.

Beware that shifting by an amount >= the number of bits in the
word remains Undefined Behavior.  This should be either documented
or fixed.  It could be fixed easily enough.
commit d7e35dfa2531b53618b9e6edcd8752ce988ac555
Author: Sasha Levin 
Date:   Thu Dec 3 22:04:01 2015 -0500

bitops.h: correctly handle rol32 with 0 byte shift

ROL on a 32 bit integer with a shift of 32 or more is undefined and the
result is arch-dependent. Avoid this by handling the trivial case of
roling by 0 correctly.

The trivial solution of checking if shift is 0 breaks gcc's detection
of this code as a ROL instruction, which is unacceptable.

This bug was reported and fixed in GCC
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57157):

	The standard rotate idiom,

	  (x << n) | (x >> (32 - n))

	is recognized by gcc (for concreteness, I discuss only the case that x
	is an uint32_t here).

	However, this is portable C only for n in the range 0 < n < 32. For n
	== 0, we get x >> 32 which gives undefined behaviour according to the
	C standard (6.5.7, Bitwise shift operators). To portably support n ==
	0, one has to write the rotate as something like

	  (x << n) | (x >> ((-n) & 31))

	And this is apparently not recognized by gcc.

Note that this is broken on older GCCs and will result in slower ROL.

Acked-by: Linus Torvalds 
Signed-off-by: Sasha Levin 
Signed-off-by: Linus Torvalds 

diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index 2b8ed12..defeaac 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -107,7 +107,7 @@ static inline __u64 ror64(__u64 word, unsigned int shift)
  */
 static inline __u32 rol32(__u32 word, unsigned int shift)
 {
-	return (word << shift) | (word >> (32 - shift));
+	return (word << shift) | (word >> ((-shift) & 31));
 }
 
 /**
commit 03b97eeb5401ede1e1d7b6fbf6a9575db8d0efa6
Author: John Denker 
Date:   Wed May 4 13:55:51 2016 -0700

Make ror64, rol64, ror32, ror16, rol16, ror8, and rol8
consistent with rol32 in their handling of shifting by a zero amount.

Same overall rationale as in d7e35dfa, just more consistently applied.

Beware that shifting by an amount >= the number of bits in the
word remains Undefined Behavior.  This should be either documented
or fixed.  It could be fixed easily enough.

diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index defeaac..97096b4 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -87,7 +87,7 @@ static __always_inline unsigned long hweight_long(unsigned long w)
  */
 static inline __u64 rol64(__u64 word, unsigned int shift)
 {
-	return (word << shift) | (word >> (64 - shift));
+	return (word << shift) | (word >> ((-shift) & 64));
 }
 
 /**
@@ -97,7 +97,7 @@ static inline __u64 rol64(__u64 word, unsigned int shift)
  */
 static inline __u64 ror64(__u64 word, unsigned int shift)
 {
-	return (word >> shift) | (word << (64 - shift));
+	return (word >> shift) | (word << ((-shift) & 64));
 }
 
 /**
@@ -117,7 +117,7 @@ static inline __u32 rol32(__u32 word, unsigned int shift)
  */
 static inline __u32 ror32(__u32 word, unsigned int shift)
 {
-	return (word >> shift) | (word << (32 - shift));
+	return (word >> shift) | (word << ((-shift) & 31));
 }
 
 /**
@@ -127,7 +127,7 @@ static inline __u32 ror32(__u32 word, unsigned int shift)
  */
 sta

Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG

2016-05-04 Thread H. Peter Anvin
On 05/04/2016 12:07 PM, ty...@thunk.org wrote:
> 
> If we are all agreed that what is in bitops.h is considered valid,
> then we can start converting people over to using the version defined
> in bitops.h, and if there is some compiler issue we need to work
> around, at least we only need to put the workaround in a single header
> file.
> 

Yes, please.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG

2016-05-04 Thread tytso
On Wed, May 04, 2016 at 11:29:57AM -0700, H. Peter Anvin wrote:
> 
> We don't care about UB, we care about gcc, and to a lesser extent
> LLVM and ICC.  If bitops.h doesn't do the right thing, we need to
> fix bitops.h.

I'm going to suggest that we treat the ro[rl]{32,64}() question as
separable from the /dev/random case.  I've sanity checked gcc 5.3.1
and it does the right thing given the small number of constant
arguments given to rotl32() in lib/chacha20.c, and it doesn't hit the
UB case which Jeffrey was concerned about.  This is also code that was
previously in crypto/chacha20_generic.c, and so if there is a bug with
some obscure compiler not properly compiling it down to a rotate
instruction, (a) no one is paying me to make sure the kernel code
compiles well on ICC, and (b) it's not scalable to have each kernel
developer try to deal with the vagrancies of compilers.

So from my perspective, the only interesting results for me is (a)
using what had been used before with crypto/chacha20_generic.c, or (b)
reusing what is in bitops.h and letting it be someone else's problem
if some obscure compiler isn't happy with what is in bitops.h

If we are all agreed that what is in bitops.h is considered valid,
then we can start converting people over to using the version defined
in bitops.h, and if there is some compiler issue we need to work
around, at least we only need to put the workaround in a single header
file.

Cheers,
- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG

2016-05-04 Thread H. Peter Anvin
On May 4, 2016 11:22:25 AM PDT, Jeffrey Walton  wrote:
>On Wed, May 4, 2016 at 1:49 PM,   wrote:
>> On Wed, May 04, 2016 at 10:40:20AM -0400, Jeffrey Walton wrote:
>>> > +static inline u32 rotl32(u32 v, u8 n)
>>> > +{
>>> > +   return (v << n) | (v >> (sizeof(v) * 8 - n));
>>> > +}
>>>
>>> That's undefined behavior when n=0.
>>
>> Sure, but it's never called with n = 0; I've double checked and the
>> compiler seems to do the right thing with the above pattern as well.
>
>> Hmm, it looks like there is a "standard" version rotate left and
>right
>> defined in include/linux/bitops.h.  So I suspect it would make sense
>> to use rol32 as defined in bitops.h --- and this is probably
>something
>
>bitops.h could work in this case, but its not an ideal solution. GCC
>does not optimize the code below as expected under all use cases
>because GCC does not recognize it as a rotate (see
>http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57157):
>
>return (v << n) | (v >> (sizeof(v) * 8 - n));
>
>And outside of special cases like Salsa, ChaCha and BLAKE2, the code
>provided in bitops.h suffers UB on arbitrary data. So I think care
>needs to be taken when selecting functions from bitops.h.
>
>> that we should do for the rest of crypto/*.c, where people seem to be
>> defininig their own version of something like rotl32 (I copied the
>> contents of crypto/chacha20_generic.c to lib/chacha20, so this
>pattern
>> of defining one's own version of rol32 isn't new).
>
>Yeah, I kind of thought there was some duplication going on.
>
>But I think bitops.h should be fixed. Many folks don't realize the
>lurking UB, and many folks don't realize its not always optimized
>well.
>
>>> I think the portable way to do a rotate that avoids UB is the
>>> following. GCC, Clang and ICC recognize the pattern, and emit a
>rotate
>>> instruction.
>>>
>>> static const unsigned int MASK=31;
>>> return (v<>(-n&MASK));
>>>
>>> You should also avoid the following because its not constant time
>due
>>> to the branch:
>>>
>>> return n == 0 ? v : (v << n) | (v >> (sizeof(v) * 8 - n));
>>>
>>
>> Where is this coming from?  I don't see this construct in the patch.
>
>My bad... It was a general observation. I've seen folks try to correct
>the UB by turning to something like that.
>
>Jeff

We don't care about UB, we care about gcc, and to a lesser extent LLVM and ICC. 
 If bitops.h doesn't do the right thing, we need to fix bitops.h.
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG

2016-05-04 Thread Jeffrey Walton
On Wed, May 4, 2016 at 1:49 PM,   wrote:
> On Wed, May 04, 2016 at 10:40:20AM -0400, Jeffrey Walton wrote:
>> > +static inline u32 rotl32(u32 v, u8 n)
>> > +{
>> > +   return (v << n) | (v >> (sizeof(v) * 8 - n));
>> > +}
>>
>> That's undefined behavior when n=0.
>
> Sure, but it's never called with n = 0; I've double checked and the
> compiler seems to do the right thing with the above pattern as well.

> Hmm, it looks like there is a "standard" version rotate left and right
> defined in include/linux/bitops.h.  So I suspect it would make sense
> to use rol32 as defined in bitops.h --- and this is probably something

bitops.h could work in this case, but its not an ideal solution. GCC
does not optimize the code below as expected under all use cases
because GCC does not recognize it as a rotate (see
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57157):

return (v << n) | (v >> (sizeof(v) * 8 - n));

And outside of special cases like Salsa, ChaCha and BLAKE2, the code
provided in bitops.h suffers UB on arbitrary data. So I think care
needs to be taken when selecting functions from bitops.h.

> that we should do for the rest of crypto/*.c, where people seem to be
> defininig their own version of something like rotl32 (I copied the
> contents of crypto/chacha20_generic.c to lib/chacha20, so this pattern
> of defining one's own version of rol32 isn't new).

Yeah, I kind of thought there was some duplication going on.

But I think bitops.h should be fixed. Many folks don't realize the
lurking UB, and many folks don't realize its not always optimized
well.

>> I think the portable way to do a rotate that avoids UB is the
>> following. GCC, Clang and ICC recognize the pattern, and emit a rotate
>> instruction.
>>
>> static const unsigned int MASK=31;
>> return (v<>(-n&MASK));
>>
>> You should also avoid the following because its not constant time due
>> to the branch:
>>
>> return n == 0 ? v : (v << n) | (v >> (sizeof(v) * 8 - n));
>>
>
> Where is this coming from?  I don't see this construct in the patch.

My bad... It was a general observation. I've seen folks try to correct
the UB by turning to something like that.

Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG

2016-05-04 Thread H. Peter Anvin
On May 4, 2016 10:30:41 AM PDT, ty...@mit.edu wrote:
>On Tue, May 03, 2016 at 10:50:25AM +0200, Stephan Mueller wrote:
>> > +/*
>> > + * crng_init =  0 --> Uninitialized
>> > + *2 --> Initialized
>> > + *3 --> Initialized from input_pool
>> > + */
>> > +static int crng_init = 0;
>> 
>> shouldn't that be an atomic_t ?
>
>The crng_init variable only gets incremented (it progresses from
>0->1->2->3) and it's protected by the primary_crng->lock spinlock.
>There are a few places where we read it without first taking the lock,
>but that's where we depend on it getting incremented and where if we
>race with another CPU which has just bumped the crng_init status, it's
>not critical.  (Or we can take the lock and then recheck the crng_init
>if we really need to be sure.)
>
>> > +static void _initialize_crng(struct crng_state *crng)
>> > +{
>> > +  int i;
>> > +  unsigned long   rv;
>> 
>> Why do you use unsigned long here? I thought the state[i] is unsigned
>int.
>
>Because it gets filled in via arch_get_random_seed_long(&rv) and
>arch_get_random_log(&rv).  If that means we get 64 bits and we only
>use 32 bits, that's OK
>
>> Would it make sense to add the ChaCha20 self test vectors from
>RFC7539 here to 
>> test that the ChaCha20 works?
>
>We're not doing that for any of the other ciphers and hashes.  We
>didn't do that for SHA1, and the chacha20 code where I took this from
>didn't check for this as well.  What threat are you most worried
>about.  We don't care about chacha20 being exactly chacha20, so long
>as it's cryptographically strong.  In fact I think I removed a
>potential host order byteswap in the set key operation specifically
>because we don't care and interop.
>
>If this is required for FIPS mode, we can add that later.  I was
>primarily trying to keep the patch small to be easier for people to
>look at it, so I've deliberately left off bells and whistles that
>aren't strictly needed to show that the new approach is sound.
>
>> > +  if (crng_init++ >= 2)
>> > +  wake_up_interruptible(&crng_init_wait);
>> 
>> Don't we have a race here with the crng_init < 3 check in crng_reseed
>
>> considering multi-core systems?
>
>No, because we are holding the primary_crng->lock spinlock.  I'll add
>a comment explaining the locking protections which is intended for
>crng_init where we declare it.
>
>
>> > +  if (num < 16 || num > 32) {
>> > +  WARN_ON(1);
>> > +  pr_err("crng_reseed: num is %d?!?\n", num);
>> > +  }
>> > +  num_words = (num + 3) / 4;
>> > +  for (i = 0; i < num_words; i++)
>> > +  primary_crng.state[i+4] ^= tmp[i];
>> > +  primary_crng.init_time = jiffies;
>> > +  if (crng_init < 3) {
>> 
>> Shouldn't that one be if (crng_init < 3 && num >= 16) ?
>
>I'll just change the above WRN_ON test to be:
>
> BUG_ON(num < 16 || num > 32);
>
>This really can't happen, and I had it as a WARN_ON with a printk for
>debugging purpose in case I was wrong about how the code works.
>
>> > +  crng_init = 3;
>> > +  process_random_ready_list();
>> > +  wake_up_interruptible(&crng_init_wait);
>> > +  pr_notice("random: crng_init 3\n");
>> 
>> Would it make sense to be more descriptive here to allow readers of
>dmesg to 
>> understand the output?
>
>Yes, what we're currently printing during the initialization:
>
>random: crng_init 1
>random: crng_init 2
>random: crng_init 3
>
>was again mostly for debugging purposes.  What I'm thinking about
>doing is changing crng_init 2 and 3 messages to be:
>
>random: crng fast init done
>random: crng conservative init done
>
>> > +  }
>> > +  ret = 1;
>> > +out:
>> > +  spin_unlock_irqrestore(&primary_crng.lock, flags);
>> 
>> memzero_explicit of tmp?
>
>Good point, I've added a call to memzero_explicit().
>
>> > +static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
>> > +{
>> > +  unsigned long v, flags;
>> > +  struct crng_state *crng = &primary_crng;
>> > +
>> > +  if (crng_init > 2 &&
>> > +  time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
>> > +  crng_reseed(&input_pool);
>> > +  spin_lock_irqsave(&crng->lock, flags);
>> > +  if (arch_get_random_long(&v))
>> > +  crng->state[14] ^= v;
>> > +  chacha20_block(&crng->state[0], out);
>> > +  if (crng->state[12] == 0)
>> > +  crng->state[13]++;
>
>> What is the purpose to only cover the 2nd 32 bit value of the nonce
>with 
>> arch_get_random?
>> 
>> state[12]++? Or why do you increment the nonce?
>
>In Chacha20, its output is a funcrtion of the state array, where
>state[0..3] is a constant specified by the Chacha20 definition,
>state[4..11] is the Key, and state[12..15] is the IV.  The
>chacha20_block() function increments state[12] each time it is called,
>so state[12] is being used as the block counter.  The increment of
>state[13] is used to make state[13] to be the upper 32-bits of the
>block counter.  We *should* be reseeding more often than 2**32 calls
>to chacha20_block

Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG

2016-05-04 Thread tytso
On Wed, May 04, 2016 at 10:40:20AM -0400, Jeffrey Walton wrote:
> > +static inline u32 rotl32(u32 v, u8 n)
> > +{
> > +   return (v << n) | (v >> (sizeof(v) * 8 - n));
> > +}
> 
> That's undefined behavior when n=0.

Sure, but it's never called with n = 0; I've double checked and the
compiler seems to do the right thing with the above pattern as well.

Hmm, it looks like there is a "standard" version rotate left and right
defined in include/linux/bitops.h.  So I suspect it would make sense
to use rol32 as defined in bitops.h --- and this is probably something
that we should do for the rest of crypto/*.c, where people seem to be
defininig their own version of something like rotl32 (I copied the
contents of crypto/chacha20_generic.c to lib/chacha20, so this pattern
of defining one's own version of rol32 isn't new).

> I think the portable way to do a rotate that avoids UB is the
> following. GCC, Clang and ICC recognize the pattern, and emit a rotate
> instruction.
> 
> static const unsigned int MASK=31;
> return (v<>(-n&MASK));
> 
> You should also avoid the following because its not constant time due
> to the branch:
> 
> return n == 0 ? v : (v << n) | (v >> (sizeof(v) * 8 - n));
> 

Where is this coming from?  I don't see this construct in the patch.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG

2016-05-04 Thread tytso
On Tue, May 03, 2016 at 10:50:25AM +0200, Stephan Mueller wrote:
> > +/*
> > + * crng_init =  0 --> Uninitialized
> > + * 2 --> Initialized
> > + * 3 --> Initialized from input_pool
> > + */
> > +static int crng_init = 0;
> 
> shouldn't that be an atomic_t ?

The crng_init variable only gets incremented (it progresses from
0->1->2->3) and it's protected by the primary_crng->lock spinlock.
There are a few places where we read it without first taking the lock,
but that's where we depend on it getting incremented and where if we
race with another CPU which has just bumped the crng_init status, it's
not critical.  (Or we can take the lock and then recheck the crng_init
if we really need to be sure.)

> > +static void _initialize_crng(struct crng_state *crng)
> > +{
> > +   int i;
> > +   unsigned long   rv;
> 
> Why do you use unsigned long here? I thought the state[i] is unsigned int.

Because it gets filled in via arch_get_random_seed_long(&rv) and
arch_get_random_log(&rv).  If that means we get 64 bits and we only
use 32 bits, that's OK

> Would it make sense to add the ChaCha20 self test vectors from RFC7539 here 
> to 
> test that the ChaCha20 works?

We're not doing that for any of the other ciphers and hashes.  We
didn't do that for SHA1, and the chacha20 code where I took this from
didn't check for this as well.  What threat are you most worried
about.  We don't care about chacha20 being exactly chacha20, so long
as it's cryptographically strong.  In fact I think I removed a
potential host order byteswap in the set key operation specifically
because we don't care and interop.

If this is required for FIPS mode, we can add that later.  I was
primarily trying to keep the patch small to be easier for people to
look at it, so I've deliberately left off bells and whistles that
aren't strictly needed to show that the new approach is sound.

> > +   if (crng_init++ >= 2)
> > +   wake_up_interruptible(&crng_init_wait);
> 
> Don't we have a race here with the crng_init < 3 check in crng_reseed 
> considering multi-core systems?

No, because we are holding the primary_crng->lock spinlock.  I'll add
a comment explaining the locking protections which is intended for
crng_init where we declare it.


> > +   if (num < 16 || num > 32) {
> > +   WARN_ON(1);
> > +   pr_err("crng_reseed: num is %d?!?\n", num);
> > +   }
> > +   num_words = (num + 3) / 4;
> > +   for (i = 0; i < num_words; i++)
> > +   primary_crng.state[i+4] ^= tmp[i];
> > +   primary_crng.init_time = jiffies;
> > +   if (crng_init < 3) {
> 
> Shouldn't that one be if (crng_init < 3 && num >= 16) ?

I'll just change the above WRN_ON test to be:

 BUG_ON(num < 16 || num > 32);

This really can't happen, and I had it as a WARN_ON with a printk for
debugging purpose in case I was wrong about how the code works.

> > +   crng_init = 3;
> > +   process_random_ready_list();
> > +   wake_up_interruptible(&crng_init_wait);
> > +   pr_notice("random: crng_init 3\n");
> 
> Would it make sense to be more descriptive here to allow readers of dmesg to 
> understand the output?

Yes, what we're currently printing during the initialization:

random: crng_init 1
random: crng_init 2
random: crng_init 3

was again mostly for debugging purposes.  What I'm thinking about
doing is changing crng_init 2 and 3 messages to be:

random: crng fast init done
random: crng conservative init done

> > +   }
> > +   ret = 1;
> > +out:
> > +   spin_unlock_irqrestore(&primary_crng.lock, flags);
> 
> memzero_explicit of tmp?

Good point, I've added a call to memzero_explicit().

> > +static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
> > +{
> > +   unsigned long v, flags;
> > +   struct crng_state *crng = &primary_crng;
> > +
> > +   if (crng_init > 2 &&
> > +   time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
> > +   crng_reseed(&input_pool);
> > +   spin_lock_irqsave(&crng->lock, flags);
> > +   if (arch_get_random_long(&v))
> > +   crng->state[14] ^= v;
> > +   chacha20_block(&crng->state[0], out);
> > +   if (crng->state[12] == 0)
> > +   crng->state[13]++;

> What is the purpose to only cover the 2nd 32 bit value of the nonce with 
> arch_get_random?
> 
> state[12]++? Or why do you increment the nonce?

In Chacha20, its output is a funcrtion of the state array, where
state[0..3] is a constant specified by the Chacha20 definition,
state[4..11] is the Key, and state[12..15] is the IV.  The
chacha20_block() function increments state[12] each time it is called,
so state[12] is being used as the block counter.  The increment of
state[13] is used to make state[13] to be the upper 32-bits of the
block counter.  We *should* be reseeding more often than 2**32 calls
to chacha20_block(), and the increment is just to be safe in case
something goes wronng and we're not reseeding.

We're using crng[14] to be contain the output of RDRAND, so this is
where

Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG

2016-05-04 Thread Jeffrey Walton
>> + chacha20_block(&crng->state[0], out);
>> + if (crng->state[12] == 0)
>> + crng->state[13]++;
>
> state[12]++? Or why do you increment the nonce?

In Bernstein's Salsa and ChaCha, the counter is 64-bit. It appears
ChaCha-TLS uses a 32-bit counter, and the other 32-bits is given to
the nonce.

Maybe the first question to ask is, what ChaCha is the kernel
providing? If its ChaCha-TLS, then the carry does not make a lot of
sense.

If the generator is limiting the amount of material under a given set
of security parameters (key and nonce), then the generator will likely
re-key itself long before the 256-GB induced wrap. In this case, it
does not matter which ChaCha the kernel is providing and the carry is
superfluous.

Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG

2016-05-04 Thread Jeffrey Walton
> +static inline u32 rotl32(u32 v, u8 n)
> +{
> +   return (v << n) | (v >> (sizeof(v) * 8 - n));
> +}

That's undefined behavior when n=0.

I think the portable way to do a rotate that avoids UB is the
following. GCC, Clang and ICC recognize the pattern, and emit a rotate
instruction.

static const unsigned int MASK=31;
return (v<>(-n&MASK));

You should also avoid the following because its not constant time due
to the branch:

return n == 0 ? v : (v << n) | (v >> (sizeof(v) * 8 - n));

Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG

2016-05-03 Thread Stephan Mueller
Am Dienstag, 3. Mai 2016, 11:36:12 schrieb Stephan Mueller:

Hi Ted,

> > +
> > +static ssize_t extract_crng_user(void __user *buf, size_t nbytes)
> > +{
> > +   ssize_t ret = 0, i;
> > +   __u8 tmp[CHACHA20_BLOCK_SIZE];
> > +   int large_request = (nbytes > 256);
> > +
> > +   while (nbytes) {
> > +   if (large_request && need_resched()) {
> > +   if (signal_pending(current)) {
> > +   if (ret == 0)
> > +   ret = -ERESTARTSYS;
> > +   break;
> > +   }
> > +   schedule();
> > +   }
> > +
> > +   extract_crng(tmp);
> > +   i = min_t(int, nbytes, CHACHA20_BLOCK_SIZE);
> > +   if (copy_to_user(buf, tmp, i)) {
> > +   ret = -EFAULT;
> > +   break;
> > +   }
> > +
> > +   nbytes -= i;
> > +   buf += i;
> > +   ret += i;
> > +   }
> > +
> > +   /* Wipe data just written to memory */
> > +   memzero_explicit(tmp, sizeof(tmp));
> 
> Would it make sense to add another chacha20_block() call here at the end?
> Note, the one thing about the SP800-90A DRBG I really like is the enhanced
> backward secrecy support which is implemented by "updating" the internal
> state (the key / state) used for one or more random number generation
> rounds after one request for random numbers is satisfied.
> 
> This means that even if the state becomes known or the subsequent caller
> manages to deduce the state of the RNG to some degree of confidence, he
> cannot backtrack the already generated random numbers.
> 
> I see that the ChaCha20 RNG implicitly updates its state while it operates.
> But for the last round of the RNG, there is no more shuffling of the
> internal state. As one round is 64 bytes in size and many callers just want
> 16 or 32 bytes (as seen during testing), a lot of callers trigger only one
> round of the RNG.

After doing some performance tests, I see that we reach a performance of north 
of 200 MB/s on my system (compare that to 12 MB/s for the SHA-1 version).

Thus, I would assume adding another call to chacha20_block should not hurt.

Ciao
Stephan
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG

2016-05-03 Thread Stephan Mueller
Am Montag, 2. Mai 2016, 02:26:51 schrieb Theodore Ts'o:

Hi Theodore,

One more item.

> The CRNG is faster, and we don't pretend to track entropy usage in the
> CRNG any more.
> 
> Signed-off-by: Theodore Ts'o 
> ---
>  crypto/chacha20_generic.c |  61 --
>  drivers/char/random.c | 282
> ++ include/crypto/chacha20.h | 
>  1 +
>  lib/Makefile  |   2 +-
>  lib/chacha20.c|  79 +
>  5 files changed, 294 insertions(+), 131 deletions(-)
>  create mode 100644 lib/chacha20.c
> 
> diff --git a/crypto/chacha20_generic.c b/crypto/chacha20_generic.c
> index da9c899..1cab831 100644
> --- a/crypto/chacha20_generic.c
> +++ b/crypto/chacha20_generic.c
> @@ -15,72 +15,11 @@
>  #include 
>  #include 
> 
> -static inline u32 rotl32(u32 v, u8 n)
> -{
> - return (v << n) | (v >> (sizeof(v) * 8 - n));
> -}
> -
>  static inline u32 le32_to_cpuvp(const void *p)
>  {
>   return le32_to_cpup(p);
>  }
> 
> -static void chacha20_block(u32 *state, void *stream)
> -{
> - u32 x[16], *out = stream;
> - int i;
> -
> - for (i = 0; i < ARRAY_SIZE(x); i++)
> - x[i] = state[i];
> -
> - for (i = 0; i < 20; i += 2) {
> - x[0]  += x[4];x[12] = rotl32(x[12] ^ x[0],  16);
> - x[1]  += x[5];x[13] = rotl32(x[13] ^ x[1],  16);
> - x[2]  += x[6];x[14] = rotl32(x[14] ^ x[2],  16);
> - x[3]  += x[7];x[15] = rotl32(x[15] ^ x[3],  16);
> -
> - x[8]  += x[12];   x[4]  = rotl32(x[4]  ^ x[8],  12);
> - x[9]  += x[13];   x[5]  = rotl32(x[5]  ^ x[9],  12);
> - x[10] += x[14];   x[6]  = rotl32(x[6]  ^ x[10], 12);
> - x[11] += x[15];   x[7]  = rotl32(x[7]  ^ x[11], 12);
> -
> - x[0]  += x[4];x[12] = rotl32(x[12] ^ x[0],   8);
> - x[1]  += x[5];x[13] = rotl32(x[13] ^ x[1],   8);
> - x[2]  += x[6];x[14] = rotl32(x[14] ^ x[2],   8);
> - x[3]  += x[7];x[15] = rotl32(x[15] ^ x[3],   8);
> -
> - x[8]  += x[12];   x[4]  = rotl32(x[4]  ^ x[8],   7);
> - x[9]  += x[13];   x[5]  = rotl32(x[5]  ^ x[9],   7);
> - x[10] += x[14];   x[6]  = rotl32(x[6]  ^ x[10],  7);
> - x[11] += x[15];   x[7]  = rotl32(x[7]  ^ x[11],  7);
> -
> - x[0]  += x[5];x[15] = rotl32(x[15] ^ x[0],  16);
> - x[1]  += x[6];x[12] = rotl32(x[12] ^ x[1],  16);
> - x[2]  += x[7];x[13] = rotl32(x[13] ^ x[2],  16);
> - x[3]  += x[4];x[14] = rotl32(x[14] ^ x[3],  16);
> -
> - x[10] += x[15];   x[5]  = rotl32(x[5]  ^ x[10], 12);
> - x[11] += x[12];   x[6]  = rotl32(x[6]  ^ x[11], 12);
> - x[8]  += x[13];   x[7]  = rotl32(x[7]  ^ x[8],  12);
> - x[9]  += x[14];   x[4]  = rotl32(x[4]  ^ x[9],  12);
> -
> - x[0]  += x[5];x[15] = rotl32(x[15] ^ x[0],   8);
> - x[1]  += x[6];x[12] = rotl32(x[12] ^ x[1],   8);
> - x[2]  += x[7];x[13] = rotl32(x[13] ^ x[2],   8);
> - x[3]  += x[4];x[14] = rotl32(x[14] ^ x[3],   8);
> -
> - x[10] += x[15];   x[5]  = rotl32(x[5]  ^ x[10],  7);
> - x[11] += x[12];   x[6]  = rotl32(x[6]  ^ x[11],  7);
> - x[8]  += x[13];   x[7]  = rotl32(x[7]  ^ x[8],   7);
> - x[9]  += x[14];   x[4]  = rotl32(x[4]  ^ x[9],   7);
> - }
> -
> - for (i = 0; i < ARRAY_SIZE(x); i++)
> - out[i] = cpu_to_le32(x[i] + state[i]);
> -
> - state[12]++;
> -}
> -
>  static void chacha20_docrypt(u32 *state, u8 *dst, const u8 *src,
>unsigned int bytes)
>  {
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index b583e53..95f4451 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -260,6 +260,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include 
>  #include 
> @@ -412,6 +413,15 @@ static struct fasync_struct *fasync;
>  static DEFINE_SPINLOCK(random_ready_list_lock);
>  static LIST_HEAD(random_ready_list);
> 
> +/*
> + * crng_init =  0 --> Uninitialized
> + *   2 --> Initialized
> + *   3 --> Initialized from input_pool
> + */
> +static int crng_init = 0;
> +#define crng_ready() (likely(crng_init >= 2))
> +static void process_random_ready_list(void);
> +
>  /**
>   *
>   * OS independent entropy store.   Here are the functions which handle
> @@ -441,10 +451,13 @@ struct entropy_store {
>   __u8 last_data[EXTRACT_SIZE];
>  };
> 
> +static ssize_t extract_entropy(struct entropy_store *r, void *buf,
> +size_t nbytes, int min, int rsvd);
> +
> +static int crng_reseed(struct entropy_store *r);
>  static void push_to_pool(struct work_struct *work);
>  static __u32 input_pool_data[INPUT_POOL_WORDS];
>  static __u32 blocking_pool_data[OUTPUT_POOL_

Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG

2016-05-03 Thread Stephan Mueller
Am Montag, 2. Mai 2016, 02:26:51 schrieb Theodore Ts'o:

Hi Theodore,

> The CRNG is faster, and we don't pretend to track entropy usage in the
> CRNG any more.

In general, I have no concerns with this approach either. And thank you that 
some of my concerns are addressed.

There are few more concerns left open. I would suggest I would write them up 
with a proposal on how to address them.

Some comments inlne:
> 
> Signed-off-by: Theodore Ts'o 
> ---
>  crypto/chacha20_generic.c |  61 --
>  drivers/char/random.c | 282
> ++ include/crypto/chacha20.h | 
>  1 +
>  lib/Makefile  |   2 +-
>  lib/chacha20.c|  79 +
>  5 files changed, 294 insertions(+), 131 deletions(-)
>  create mode 100644 lib/chacha20.c
> 
> diff --git a/crypto/chacha20_generic.c b/crypto/chacha20_generic.c
> index da9c899..1cab831 100644
> --- a/crypto/chacha20_generic.c
> +++ b/crypto/chacha20_generic.c
> @@ -15,72 +15,11 @@
>  #include 
>  #include 
> 
> -static inline u32 rotl32(u32 v, u8 n)
> -{
> - return (v << n) | (v >> (sizeof(v) * 8 - n));
> -}
> -
>  static inline u32 le32_to_cpuvp(const void *p)
>  {
>   return le32_to_cpup(p);
>  }
> 
> -static void chacha20_block(u32 *state, void *stream)
> -{
> - u32 x[16], *out = stream;
> - int i;
> -
> - for (i = 0; i < ARRAY_SIZE(x); i++)
> - x[i] = state[i];
> -
> - for (i = 0; i < 20; i += 2) {
> - x[0]  += x[4];x[12] = rotl32(x[12] ^ x[0],  16);
> - x[1]  += x[5];x[13] = rotl32(x[13] ^ x[1],  16);
> - x[2]  += x[6];x[14] = rotl32(x[14] ^ x[2],  16);
> - x[3]  += x[7];x[15] = rotl32(x[15] ^ x[3],  16);
> -
> - x[8]  += x[12];   x[4]  = rotl32(x[4]  ^ x[8],  12);
> - x[9]  += x[13];   x[5]  = rotl32(x[5]  ^ x[9],  12);
> - x[10] += x[14];   x[6]  = rotl32(x[6]  ^ x[10], 12);
> - x[11] += x[15];   x[7]  = rotl32(x[7]  ^ x[11], 12);
> -
> - x[0]  += x[4];x[12] = rotl32(x[12] ^ x[0],   8);
> - x[1]  += x[5];x[13] = rotl32(x[13] ^ x[1],   8);
> - x[2]  += x[6];x[14] = rotl32(x[14] ^ x[2],   8);
> - x[3]  += x[7];x[15] = rotl32(x[15] ^ x[3],   8);
> -
> - x[8]  += x[12];   x[4]  = rotl32(x[4]  ^ x[8],   7);
> - x[9]  += x[13];   x[5]  = rotl32(x[5]  ^ x[9],   7);
> - x[10] += x[14];   x[6]  = rotl32(x[6]  ^ x[10],  7);
> - x[11] += x[15];   x[7]  = rotl32(x[7]  ^ x[11],  7);
> -
> - x[0]  += x[5];x[15] = rotl32(x[15] ^ x[0],  16);
> - x[1]  += x[6];x[12] = rotl32(x[12] ^ x[1],  16);
> - x[2]  += x[7];x[13] = rotl32(x[13] ^ x[2],  16);
> - x[3]  += x[4];x[14] = rotl32(x[14] ^ x[3],  16);
> -
> - x[10] += x[15];   x[5]  = rotl32(x[5]  ^ x[10], 12);
> - x[11] += x[12];   x[6]  = rotl32(x[6]  ^ x[11], 12);
> - x[8]  += x[13];   x[7]  = rotl32(x[7]  ^ x[8],  12);
> - x[9]  += x[14];   x[4]  = rotl32(x[4]  ^ x[9],  12);
> -
> - x[0]  += x[5];x[15] = rotl32(x[15] ^ x[0],   8);
> - x[1]  += x[6];x[12] = rotl32(x[12] ^ x[1],   8);
> - x[2]  += x[7];x[13] = rotl32(x[13] ^ x[2],   8);
> - x[3]  += x[4];x[14] = rotl32(x[14] ^ x[3],   8);
> -
> - x[10] += x[15];   x[5]  = rotl32(x[5]  ^ x[10],  7);
> - x[11] += x[12];   x[6]  = rotl32(x[6]  ^ x[11],  7);
> - x[8]  += x[13];   x[7]  = rotl32(x[7]  ^ x[8],   7);
> - x[9]  += x[14];   x[4]  = rotl32(x[4]  ^ x[9],   7);
> - }
> -
> - for (i = 0; i < ARRAY_SIZE(x); i++)
> - out[i] = cpu_to_le32(x[i] + state[i]);
> -
> - state[12]++;
> -}
> -
>  static void chacha20_docrypt(u32 *state, u8 *dst, const u8 *src,
>unsigned int bytes)
>  {
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index b583e53..95f4451 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -260,6 +260,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include 
>  #include 
> @@ -412,6 +413,15 @@ static struct fasync_struct *fasync;
>  static DEFINE_SPINLOCK(random_ready_list_lock);
>  static LIST_HEAD(random_ready_list);
> 
> +/*
> + * crng_init =  0 --> Uninitialized
> + *   2 --> Initialized
> + *   3 --> Initialized from input_pool
> + */
> +static int crng_init = 0;

shouldn't that be an atomic_t ?

> +#define crng_ready() (likely(crng_init >= 2))
> +static void process_random_ready_list(void);
> +
>  /**
>   *
>   * OS independent entropy store.   Here are the functions which handle
> @@ -441,10 +451,13 @@ struct entropy_store {
>   __u8 last_data[EXTRACT_SIZE];
>  };
> 
> +static ssize_t extract_entropy(struct entropy_store *r, void *buf,
>