Re: packet timestamps (was Re: Changes to make /dev/*random better sooner)

2014-04-15 Thread David Young
On Wed, Apr 09, 2014 at 04:36:26PM -0700, Dennis Ferguson wrote:
> What I would like to do is to make per-packet timestamps (which you
> are doing already) more widely useful by moving where the timestamp is
> taken closer to the input interrupt which signals a packet delivery
> and carrying it in each packet's mbuf metadata.  This has a nice effect
> on the rare occasion that the packet is delivered to a socket with a
> consumer with cares about packet arrival times (like an NTP or IEEE 1588
> implementation), but it can also provide a new measure of the performance
> of the network code when making changes to it (how long did it take packets
> to get from the input interface to the socket, or from the input interface
> to the output interface?) which doesn't exist now.  In fact it would be
> nice to have it right now so the effect of the changes you are making
> could be measured instead of just speculating.  I was also imagining that
> the random number machinery would harvest timestamps from the mbufs,
> but maybe only after it is determined if the timestamp is one a user
> is interested in so it didn't use those.

FWIW, based on a suggestion by Dennis, in August I added a timestamp
capability to MBUFTRACE for my use at $DAYJOB.  MBUFTRACE thus enhanced
shows me how long it takes (cycles maximum, cycles average) for packets
to make ownership transitions: e.g., from IP queue to wm0 transmission
queue.  This has been useful for finding out where to concentrate my
optimization efforts, and it has helped to rule-in or -out hypotheses
about networking bugs.  Thanks, Dennis.

Here and there I have also fixed an MBUFTRACE bug, and I have made some
changes designed to reduce counters' cache footprint.  I call my variant
of MBUFTRACE, MBUFTRACE3.  I hope to feed MBUFTRACE3 back one of these
days.

Here is a sample of 'netstat -ssm' output when MBUFTRACE3 is operating
on a box with "background" levels of traffic---there are two tables,
the first that you will recognize, and the second which is new:

 smallextcluster
unix   inuse 3  1  1
 arp hold  inuse 8  0  0
 wm8 rx ring   inuse  1024   1024   1024
 wm7 rx ring   inuse  1024   1024   1024
 wm6 rx ring   inuse  1024   1024   1024
 wm5 rx ring   inuse  1024   1024   1024
 wm4 rx ring   inuse  1024   1024   1024
 wm3 rx ring   inuse  1024   1024   1024
 wm2 rx ring   inuse  1024   1024   1024
 wm1 rx ring   inuse  1024   1024   1024
 wm0 rx ring   inuse  1024   1024   1024
 unknown data  inuse 34802  34802  0
 revoked   inuse   1273302  63033  22922
 microsecs/tr   max # transitions   previous owner -> new owner
617 2,068 wm8 cpu5 rxq -> wm8 rx  
827   199   udp rx -> revoked 
954,71910,772 defer arp_deferral -> revoked 
   132682   route  -> revoked 
   70 1,627,735   835,683unix  -> revoked 
  241   487,685 3,977  ixg1 rx -> arp 
  260   260 1 arp hold -> ixg0 tx 
1,410   456,389   772  ixg0 rx -> arp 
   22,296 6,712,491 2,082  ixg1 tx -> revoked 
  315,846 6,293,761   136  ixg0 tx -> revoked 
  516,585   709,19313  wm4 tx ring -> revoked 

There are microseconds in that table, but netstat reads the CPU
frequency, average and maximum cycles/transition from the kernel and
does the arithmetic.  I'm using the CPU cycle counter to make all of the
timestamps.  I'm not compensating for clock drift or anything.

A more suitable display for this information than a table is *probably*
a directed graph with a vertex corresponding to each mbuf owner, and
an edge corresponding to each owner->owner transition.  Set the area
of each vertex in proportion to the mbufs in-use by the corresponding
owner, and set the width of each edge in proportion to the rate of
transitions.  Label each vertex with an mbuf-owner name.  Graphs for
normal/high-performance and abnormal/low-performance machines will have
distinct patterns, and the graph will help to illuminate bottlenecks.
If anyone is interested in programming this, let me know, and I will
describe in more detail what I have in mind.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217)

Re: Changes to make /dev/*random better sooner

2014-04-11 Thread David Laight
On Thu, Apr 10, 2014 at 04:14:46PM -0700, Dennis Ferguson wrote:
> 
> On 10 Apr, 2014, at 05:34 , Thor Lancelot Simon  wrote:
> 
> > On Wed, Apr 09, 2014 at 04:36:26PM -0700, Dennis Ferguson wrote:
> >> 
> >> I'd really like to understand what problem is fixed by this.  It
> >> seems to make the code more expensive (significantly so since it
> >> precludes using timestamps in their cheapest-to-obtain form) but
> >> I'm missing the benefit this cost pays for.
> > 
> > It's no more expensive to sample a 64-bit than a 32-bit cycle counter,
> > if you have both.  Where do we have access to only a 32-bit cycle
> > counter?  I admit that the problem exists in theory.  I am not so sure
> > at all that it exists in practice.
> 
> 32 bit ARM processors have a 32-bit CPU cycle counter, when they
> have one.  PowerPC processors have a 64-bit counter but the 32-bit
> instruction set provides no way to get an atomic sample of all 64
> bits.  It requires three "special" instructions followed by a check
> and a possible repeat of the three instructions to get a consistent
> sample, which makes that significantly less useful for accurate event
> timing than the single atomic instruction which obtains the low order
> 32 bits alone.  I know i386, and 32-bit sparc running on a 64-bit
> processor, can get atomic samples of 64 bits of cycle counter from
> the 32-bit instruction set but I think those are exceptions rather
> than rules.

For the purposes of obtaining entropy it doesn't matter if the high
and low parts don't match.
Is there likely to be interesting entropy in the high bits anyway
- certainly not more than once.

Also, having read high, low, high and found that the two 'high'
values differ, take the latter high bits and zero the low bits.
The value returned occurred while the counter was being read -
so is a valid return value.

David

-- 
David Laight: da...@l8s.co.uk


Re: Changes to make /dev/*random better sooner

2014-04-11 Thread Taylor R Campbell
   Date: Tue, 8 Apr 2014 00:25:32 -0400
   From: Thor Lancelot Simon 

   Attached are the changes from the tls-earlyentropy branch, which tries
   to make the output of /dev/random less predictable -- particularly for
   an attacker outside the box -- earlier.

   I intend to merge these soon.  Comment would be much appreciated.

I haven't found time to take a close look at these, but my first three
quick reactions are:

1. Getting entropy into newly installed systems should be a priority
far higher than spending effort trying to estimate newly gathered
entropy, and perhaps ought to me discussed and merged separately.
(See, e.g., .)

2. I'm inclined to say entropy estimation is something we ought to do
*off-line* for every kind of source, and we ought to statically write
down an upper bound on the amount of entropy per sample from sources,
rather than trying to estimate entropy on-line, with the option of
letting the system administrator control it with rndctl (e.g., to say:
`I'm about to bang on the keyboard like a monkey, please take that as
1 bit per sample rather than 0 bits per sample').

3. If you really want to use lzf as a dynamic entropy estimator, we'll
need to move it into src/sys/external -- kernel sources aren't allowed
to rely on anything outside src/sys and src/common (and nothing new is
allowed in src/common, I believe).


Re: Changes to make /dev/*random better sooner

2014-04-10 Thread Dennis Ferguson

On 10 Apr, 2014, at 05:34 , Thor Lancelot Simon  wrote:

> On Wed, Apr 09, 2014 at 04:36:26PM -0700, Dennis Ferguson wrote:
>> 
>> I'd really like to understand what problem is fixed by this.  It
>> seems to make the code more expensive (significantly so since it
>> precludes using timestamps in their cheapest-to-obtain form) but
>> I'm missing the benefit this cost pays for.
> 
> It's no more expensive to sample a 64-bit than a 32-bit cycle counter,
> if you have both.  Where do we have access to only a 32-bit cycle
> counter?  I admit that the problem exists in theory.  I am not so sure
> at all that it exists in practice.

32 bit ARM processors have a 32-bit CPU cycle counter, when they
have one.  PowerPC processors have a 64-bit counter but the 32-bit
instruction set provides no way to get an atomic sample of all 64
bits.  It requires three "special" instructions followed by a check
and a possible repeat of the three instructions to get a consistent
sample, which makes that significantly less useful for accurate event
timing than the single atomic instruction which obtains the low order
32 bits alone.  I know i386, and 32-bit sparc running on a 64-bit
processor, can get atomic samples of 64 bits of cycle counter from
the 32-bit instruction set but I think those are exceptions rather
than rules.

I should make clear the reason I'm interested in this.  I would
like to add packet arrival timestamping to the system for purposes
other than random number generation.  This could be done while
ignoring the random number sampling, treating them as ships in
the night, but beyond the annoyance of now taking two packet
arrival timestamps when one would do it is also the case that
when both processes are sampling the same counter (the one
providing the system clock) then I will be taking timestamps
which are strongly correlated with the timestamps you are taking
and sometimes delivering mine directly to users while you are
simultaneously assuming your timestamps are somehow a bit
difficult to guess.

This is unlikely to matter in practice, I guess, but it smells
very bad.  I hence think the packet timestamping implementations
need to be integrated to avoid using stuff being told to users
for random number fodder (or, at least, not counting it as entropy).
This would, incidentally, eliminate the extra timestamps too.  I'm
hence interested in impediments to doing this well, and the 64 bit
thing is one of those.

> The problem that is fixed is multiple wraparound of the counter.  I
> saw this effect with many sources on many systems.  Some samples really
> are very low rate, but they nonetheless are worth collecting.
> The multiple-wrap problem is one of several problems with the delta
> estimator, which I am beginning to suspect is not really well suited
> for entropy estimation of timestamps (it does work well on sequences
> like disk block or memory page addresses).

I understand there may be a problem associated with counter wrap
around but I still have no idea what it is. I was hoping you'd
explain it, or at least describe what made you think it is a
problem.  I can't even guess, my reading of the actual code and
the changes you made to it for this comes up with bupkis.  I also
spent a while running 64 bit values (random and samples from a
small random number of cycles of an lfsr) and the same values truncated
to 32 bits through the function and found the latter increased the
rate of 0 returns from the function from about none to about 1 in
10^-9 (which is in fact the predictable result).  I'm at a loss as
to why this would be important, so I'm just lost.

> The delta estimator seems to almost _always_ attribute entropy to timestamps,
> on modern machines, so it is a good thing it attributes only 1 bit.  I am
> more than open to suggestions of a better entropy estimator for timestamps;
> I had some hopes for LZ but the data seem to show clearly that's not it.

I don't know.  Modern machines have extremely high speed counters and
those, sampled asynchronously, should always have some unguessable bits
at the low order end (and probably more than one).

> It is noteworthy though that this work is not done when the samples are
> enqueued; it is done in batches when they are dequeued and processed,
> and without locks held.  So though there is a cost, it may not be happening
> in the context in which you're worried about it.

No, the cost I'm worried about is the cost of calling nanouptime() for
timestamps in this context on the machines where you do this.  I can
give you a value which is significantly cheaper to obtain than with
nanouptime(), which is guaranteed to have all the entropy that the
call to nanouptime() would provide but which may only carry 32 bits
(or less).  I'm wondering why you couldn't use this instead, but the
answer to that would require me to understand what it is about counter
wrap that is a problem.

Dennis Ferguson



Re: Changes to make /dev/*random better sooner

2014-04-10 Thread Thor Lancelot Simon
On Wed, Apr 09, 2014 at 04:36:26PM -0700, Dennis Ferguson wrote:
> 
> I'd really like to understand what problem is fixed by this.  It
> seems to make the code more expensive (significantly so since it
> precludes using timestamps in their cheapest-to-obtain form) but
> I'm missing the benefit this cost pays for.

It's no more expensive to sample a 64-bit than a 32-bit cycle counter,
if you have both.  Where do we have access to only a 32-bit cycle
counter?  I admit that the problem exists in theory.  I am not so sure
at all that it exists in practice.

The problem that is fixed is multiple wraparound of the counter.  I
saw this effect with many sources on many systems.  Some samples really
are very low rate, but they nonetheless are worth collecting.
The multiple-wrap problem is one of several problems with the delta
estimator, which I am beginning to suspect is not really well suited
for entropy estimation of timestamps (it does work well on sequences
like disk block or memory page addresses).

The delta estimator seems to almost _always_ attribute entropy to timestamps,
on modern machines, so it is a good thing it attributes only 1 bit.  I am
more than open to suggestions of a better entropy estimator for timestamps;
I had some hopes for LZ but the data seem to show clearly that's not it.

It is noteworthy though that this work is not done when the samples are
enqueued; it is done in batches when they are dequeued and processed,
and without locks held.  So though there is a cost, it may not be happening
in the context in which you're worried about it.

Thor


Re: Changes to make /dev/*random better sooner

2014-04-10 Thread Dennis Ferguson

On 7 Apr, 2014, at 21:25 , Thor Lancelot Simon  wrote:
> 1) Avoid wraparound problems with delta estimator by making estimation
>   framework 64-bit.
> 
> 2) Adjust rnd_counter to always return a 64-bit value, accordingly.

Hi,

I'd really like to understand what problem is fixed by this.  It
seems to make the code more expensive (significantly so since it
precludes using timestamps in their cheapest-to-obtain form) but
I'm missing the benefit this cost pays for.

I think (but I could be wrong) that the core of the change involves
this

 /*
- * Use the timing of the event to estimate the entropy gathered.
+ * Use the timing/value of the event to estimate the entropy gathered.
  * If all the differentials (first, second, and third) are non-zero, return
  * non-zero.  If any of these are zero, return zero.
  */
-static inline u_int32_t
-rnd_estimate_entropy(krndsource_t *rs, u_int32_t t)
+static inline uint32_t
+rnd_delta_estimate(rnd_delta_t *d, uint64_t v, int64_t delta)
 {
-   int32_t delta, delta2, delta3;
-
-   /*
-* If the time counter has overflowed, calculate the real difference.
-* If it has not, it is simplier.
-*/
-   if (t < rs->last_time)
-   delta = UINT_MAX - rs->last_time + t;
-   else
-   delta = rs->last_time - t;
+   int64_t delta2, delta3;
 
-   if (delta < 0)
-   delta = -delta;
+   d->insamples++;
 
/*
 * Calculate the second and third order differentials
 */
-   delta2 = rs->last_delta - delta;
+   delta2 = d->dx - delta;
if (delta2 < 0)
delta2 = -delta2;
 
-   delta3 = rs->last_delta2 - delta2;
+   delta3 = d->d2x - delta2;
if (delta3 < 0)
delta3 = -delta3;
 
-   rs->last_time = t;
-   rs->last_delta = delta;
-   rs->last_delta2 = delta2;
+   d->x = v;
+   d->dx = delta;
+   d->d2x = delta2;
 
/*
 * If any delta is 0, we got no entropy.  If all are non-zero, we
if (delta == 0 || delta2 == 0 || delta3 == 0)
return (0);
 
+   d->outbits++;
return (1);
 }

but I'm not seeing any problem with wrap around which is being
fixed.  If you have an unsigned 32 bit counter, and sample tc0
from the counter earlier and tc1 later, then the unsigned 32 bit
number of counter ticks which have elapsed between tc0 and tc1
will almost always be

uint32_t tc0, tc1, delta_tc;
[...]
delta_tc = tc1 - tc0;

It needs no more code than that.  The only way that will fail to
be true is if the time between the samples is so long that the
counter wraps 32 bits, but I don't see a problem with that.  The
fastest hardware counter in a machine you can buy today will take
more than a second to wrap 32 bits, so if that happens frequently
you aren't getting much data from this source in any case.
Additionally, each datum you get from a counter like this, which is
sampled asynchronously at a very low rate compared to its period,
will be very entropy-rich, so the very worst thing that can happen
by truncating the difference to 32 bits seems to be that you
might grossly underestimate the entropy of the sample (and, with
a miniscule probability, maybe toss a sample that you should keep
because the wrap produced a delta_tc of 0), but since you are
always estimating the entropy at a constant 1 bit you are
essentially making a pessimal estimate all the time anyway.  This
change to 64 bit arithmetic will more than double the number of
instructions it takes to compute this on a 32 bit machine, but
all it seems to save here is a miniscule probability of tossing
a timestamp you should keep.  I feel like I've got to be missing
something.

That doesn't bother me, though.  I'm more concerned about the
implications of this change for stuff I would like to do, in
particular this consequence:

/*
- * Generate a 32-bit counter.  This should be more machine dependent,
- * using cycle counters and the like when possible.
+ * Generate a 64-bit counter.
  */
-static inline u_int32_t
+static inline uint64_t
 rnd_counter(void)
 {
-   struct timeval tv;
+   struct timespec ts;
+   uint64_t ret;
 
 #if defined(__HAVE_CPU_COUNTER)
-   if (cpu_hascounter())
-   return (cpu_counter32());
+   if (cpu_hascounter() && sizeof(cpu_counter() == sizeof(uint64_t))) {
+   return (cpu_counter());
+   }
 #endif
if (rnd_ready) {
-   microtime(&tv);
-   return (tv.tv_sec * 100 + tv.tv_usec);
+   nanouptime(&ts);
+   ret = ts.tv_sec;
+   ret *= (uint64_t)10;
+   ret += ts.tv_nsec;
+   return ret;
}
/* when called from rnd_init, its too early to call microtime safely */
return (0);

In all cases where the hardware counter isn't 64 bits wide you
are instead acquiring a full, 64 bit time-of-day timestamp for
a computation which by its 

Re: Changes to make /dev/*random better sooner

2014-04-09 Thread Mindaugas Rasiukevicius
Thor Lancelot Simon  wrote:
> On Wed, Apr 09, 2014 at 02:43:23AM +0100, Mindaugas Rasiukevicius wrote:
> >
> > This is a very positive work for the cases when system is used as a
> > server or other workloads which may involve cryptographic activity.
> > However, you seem to assume that aggressive entropy collection is
> > always preferable even if it has extra costs.
> 
> Can you give an example of this?  How exactly will the changes on this
> branch measurably impact performance?  Can you demonstrate with a
> benchmark?
> 
> > Please make this tunable.  Also, please make this tunable *before*
> > merge, otherwise this will just never happen.  Like cprng_fast/strong
> > functions were never optimised to not use spin-lock (at least they no
> > longer use IPL_HIGH, which should have been considered simply as a bug).
> 
> As I believe my benchmarks at the time amply demonstrated, the
> cprng_strong function is considerably more efficient than the entropy
> extraction code it replaced.  Can you please explain how this is an
> example of my supposed disdain for performance?
> 
> Would you prefer the system to be slower, as it was, when user processes
> ask for entropy, rather than faster, as I demonstrably made it?
> <...>

Perhaps I missed it, but were the benchmark results published?  Also, how
about cprng_fast32/64()?  If the current mechanism is indeed faster, then
I am glad that you have made an improvement.

> > Few fragments which caught my eye while skimming through the diff..
> > 
> > >  #if defined(__HAVE_CPU_COUNTER)
> > > - if (cpu_hascounter())
> > > - return (cpu_counter32());
> > > + if (cpu_hascounter() && sizeof(cpu_counter() == sizeof
> > > (uint64_t))) {
> > > + return (cpu_counter());
> > > + }
> > >  #endif
> > 
> > ??
> 
> We provide no MI API for obtaining a counter value of any known size
> except 32 bits, unfortunately.  The instrumentation I added <...>

Thanks for explanation, but I was trying to point out the bug.

> > > - rnd_add_uint32(&skewsrc, rnd_counter());
> > > - callout_schedule(&skew_callout, hz);
> > > + rnd_add_uint64(&skewsrc, rnd_counter());
> > > + callout_schedule(&skew_callout, hz / 10);
> > 
> > Do we really need to run 10 extra callouts per second to get an extra
> > bit?
> 
> Are you seriously telling me that you are concerned that adding 9 callouts
> per second will measurably impact system performance?
> 
> I would submit that the change I made in my previous work on rnd, to which
> you now seem to be retroactively objecting, to eliminate the constant use
> of 1-tick callouts by the entropy addition code, almost certainly more
> than makes up for any performance impact caused by adding 9 callouts per
> second here.

It is a systemic problem.  Everybody thinks 10 extra callouts per second
are peanuts or that an extra worker kthread does not matter, etc.  Where
are we getting in the sum?  There are better solutions, but the quick ad
hoc ones tend to win.  See below.

> 
> > I did not investigate, but we have plenty of already existing callouts.
> > Is, or can, flipflop logic be used with them?  Or deltas between *all*
> > callouts in the system?
> 
> Do you really think that this kind of invasive change to the internals
> of the callout mechanism will be cheaper?

Absolutely.  Because instead of spending the cycles in extra callout
processing and software interrupt scheduling (which probably cost more
than your skew function), it would rely on the fact that other systems
already do this and you can merely perform you skew function.  Of course,
it would have to be done properly and it would need an efficient entropy
collection mechanism.

> > > +static void kprintf_rnd_get(size_t bytes, void *priv)
> > > +{
> > > + if (mutex_tryenter(&kprintf_mtx)) {
> > > + if (kprnd_added) {
> > > + SHA512_Final(kprnd_accum, &kprnd_sha);
> > > + rnd_add_data(&rnd_printf_source,
> > > +  kprnd_accum, sizeof
> > > (kprnd_accum), 0);
> > > + kprnd_added = 0;
> > > + /* This, we must do, since we called _Final.
> > > */
> > > + SHA512_Init(&kprnd_sha);
> > > + /* This is optional but seems useful. */
> > > + SHA512_Update(&kprnd_sha, kprnd_accum,
> > > +   sizeof(kprnd_accum));
> > > + }
> > > + mutex_exit(&kprintf_mtx);
> > > + }
> > > +}
> > 
> > Pre-checking kprnd_added first rather than locking/unlocking mutex all
> > the time would be a start.. but how about not performing SHA-512 while
> > holding IPL_HIGH mutex?  Not only embedded systems, but perhaps our VAX
> > or m68k users would also prefer to not have micro-freezes every second
> > while printing to a console.
> 
> Glad to move the check to kprnd_added.  Thanks.
> 
> However, can you actually demonstrate that this purported effect is
> perceptible to humans?  We're talking about hashin

Re: Changes to make /dev/*random better sooner

2014-04-09 Thread Thor Lancelot Simon
On Wed, Apr 09, 2014 at 08:30:31AM +0200, Manuel Bouyer wrote:
> On Tue, Apr 08, 2014 at 10:33:32PM -0400, Thor Lancelot Simon wrote:
> > > > -   rnd_add_uint32(&skewsrc, rnd_counter());
> > > > -   callout_schedule(&skew_callout, hz);
> > > > +   rnd_add_uint64(&skewsrc, rnd_counter());
> > > > +   callout_schedule(&skew_callout, hz / 10);
> > > 
> > > Do we really need to run 10 extra callouts per second to get an extra bit?
> > 
> > Are you seriously telling me that you are concerned that adding 9 callouts
> > per second will measurably impact system performance?
> 
> The problem I see with this is with power, especially for embeeded systems.
> When we'll go with a tickless kenrel (which I still hope to see), this is
> going to wake up the CPU 10 times per second on an idle system.

When we have a tickless kernel, we can use the polled-entropy-source
functionality I added to make this run only when entropy is urgently
needed.  I'll be glad to make this change as soon as we have a tickless
kernel.



Re: Changes to make /dev/*random better sooner

2014-04-08 Thread Manuel Bouyer
On Tue, Apr 08, 2014 at 10:33:32PM -0400, Thor Lancelot Simon wrote:
> > > - rnd_add_uint32(&skewsrc, rnd_counter());
> > > - callout_schedule(&skew_callout, hz);
> > > + rnd_add_uint64(&skewsrc, rnd_counter());
> > > + callout_schedule(&skew_callout, hz / 10);
> > 
> > Do we really need to run 10 extra callouts per second to get an extra bit?
> 
> Are you seriously telling me that you are concerned that adding 9 callouts
> per second will measurably impact system performance?

The problem I see with this is with power, especially for embeeded systems.
When we'll go with a tickless kenrel (which I still hope to see), this is
going to wake up the CPU 10 times per second on an idle system.
We should try to avoid that.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: Changes to make /dev/*random better sooner

2014-04-08 Thread Thor Lancelot Simon
On Tue, Apr 08, 2014 at 09:45:47PM -0500, Dave Huang wrote:
> 
> I don't have any knowledge or opinion about that, but maybe the "??" 
> has to do with the sizeof? I think the parens are wrong; they're around 
> the entire equality comparison.

Hell.  Yes they are.  Thanks both of you for this catch.

FWIW, I did almost all my testing with the counter code commented out
because it should considerably slow things down, trying to find a
performance impact I could not manage to find.

Thor


Re: Changes to make /dev/*random better sooner

2014-04-08 Thread Dave Huang

On Apr 8, 2014, at 21:33, Thor Lancelot Simon  wrote:

> On Wed, Apr 09, 2014 at 02:43:23AM +0100, Mindaugas Rasiukevicius wrote:
>> Few fragments which caught my eye while skimming through the diff..
>> 
>>> #if defined(__HAVE_CPU_COUNTER)
>>> -   if (cpu_hascounter())
>>> -   return (cpu_counter32());
>>> +   if (cpu_hascounter() && sizeof(cpu_counter() == sizeof(uint64_t))) {
>>> +   return (cpu_counter());
>>> +   }
>>> #endif
>> 
>> ??
> 
> We provide no MI API for obtaining a counter value of any known size except
> 32 bits, unfortunately.  The instrumentation I added while developing these
> changes revealed that the delta entropy estimator was terribly broken due
> to wraparound; changing it to 64 bits is the fix.

I don't have any knowledge or opinion about that, but maybe the "??" 
has to do with the sizeof? I think the parens are wrong; they're around 
the entire equality comparison.
-- 
Name: Dave Huang |  Mammal, mammal / their names are called /
INet: k...@azeotrope.org |  they raise a paw / the bat, the cat /
FurryMUCK: Dahan |  dolphin and dog / koala bear and hog -- TMBG
Dahan: Hani G Y+C 38 Y++ L+++ W- C++ T++ A+ E+ S++ V++ F- Q+++ P+ B+ PA+ PL++



Re: Changes to make /dev/*random better sooner

2014-04-08 Thread Thor Lancelot Simon
On Wed, Apr 09, 2014 at 02:43:23AM +0100, Mindaugas Rasiukevicius wrote:
>
> This is a very positive work for the cases when system is used as a server
> or other workloads which may involve cryptographic activity.  However, you
> seem to assume that aggressive entropy collection is always preferable even
> if it has extra costs.

Can you give an example of this?  How exactly will the changes on this
branch measurably impact performance?  Can you demonstrate with a benchmark?

> Please make this tunable.  Also, please make this tunable *before* merge,
> otherwise this will just never happen.  Like cprng_fast/strong functions
> were never optimised to not use spin-lock (at least they no longer use
> IPL_HIGH, which should have been considered simply as a bug).

As I believe my benchmarks at the time amply demonstrated, the cprng_strong
function is considerably more efficient than the entropy extraction code
it replaced.  Can you please explain how this is an example of my supposed
disdain for performance?

Would you prefer the system to be slower, as it was, when user processes
ask for entropy, rather than faster, as I demonstrably made it?

Unfortunately, the arc4random() implementation which I renamed to cprng_fast
(almost two years ago, one should note) had a severe concurrency bug which
had real security implications.  I highlighted the use of a spin lock to fix
that code as an emergency measure and at the time, I asked you if you could
propose any other quick, simple fix.  I do not recall that you proposed one.

Another developer claimed at the time to have a lockless, concurrent
replacement for the arc4random() code almost ready to commit.  I won't
call him or her out publically but we both know who it is and if you're
going to pester someone about arc4random()/cprng_fast() I wish you'd pick
the right target -- not me.

Honestly I'm sick of getting beat up for other people's mistakes with
arc4random.  I think it's uncalled for.  I wish you'd stop.  I did not
even touch any cprng functions in the current batch of changes and you
are still complaining about them.  Doesn't seem right to me.

> Few fragments which caught my eye while skimming through the diff..
> 
> >  #if defined(__HAVE_CPU_COUNTER)
> > -   if (cpu_hascounter())
> > -   return (cpu_counter32());
> > +   if (cpu_hascounter() && sizeof(cpu_counter() == sizeof(uint64_t))) {
> > +   return (cpu_counter());
> > +   }
> >  #endif
> 
> ??

We provide no MI API for obtaining a counter value of any known size except
32 bits, unfortunately.  The instrumentation I added while developing these
changes revealed that the delta entropy estimator was terribly broken due
to wraparound; changing it to 64 bits is the fix.

The code already fell back to microtime() if it couldn't get a counter,
which is already the case on older platforms, so though this may look
odd, I think it is essentially a 64-bit version of precisely what we did
before.

It would be nice to be able to get at all -- even not the currently
selected -- timecounters directly in a non-horrible,
non-abstraction-violating way.  That might better address this as well as
your concerns about the clock-skew randomness source below.  I am hopeful
that the current discussion of clocks on tech-kern will lead to generic
functionality that could allow comparison of clocks as an entropy source
with less gyrations.

> > -   rnd_add_uint32(&skewsrc, rnd_counter());
> > -   callout_schedule(&skew_callout, hz);
> > +   rnd_add_uint64(&skewsrc, rnd_counter());
> > +   callout_schedule(&skew_callout, hz / 10);
> 
> Do we really need to run 10 extra callouts per second to get an extra bit?

Are you seriously telling me that you are concerned that adding 9 callouts
per second will measurably impact system performance?

I would submit that the change I made in my previous work on rnd, to which
you now seem to be retroactively objecting, to eliminate the constant use
of 1-tick callouts by the entropy addition code, almost certainly more than
makes up for any performance impact caused by adding 9 callouts per second
here.

> I did not investigate, but we have plenty of already existing callouts.
> Is, or can, flipflop logic be used with them?  Or deltas between *all*
> callouts in the system?

Do you really think that this kind of invasive change to the internals
of the callout mechanism will be cheaper?

> > +static void kprintf_rnd_get(size_t bytes, void *priv)
> > +{
> > +   if (mutex_tryenter(&kprintf_mtx)) {
> > +   if (kprnd_added) {
> > +   SHA512_Final(kprnd_accum, &kprnd_sha);
> > +   rnd_add_data(&rnd_printf_source,
> > +kprnd_accum, sizeof(kprnd_accum), 0);
> > +   kprnd_added = 0;
> > +   /* This, we must do, since we called _Final. */
> > +   SHA512_Init(&kprnd_sha);
> > +   /* This is optional but seems useful. */
> >

Re: Changes to make /dev/*random better sooner

2014-04-08 Thread Mindaugas Rasiukevicius
Thor Lancelot Simon  wrote:
> Attached are the changes from the tls-earlyentropy branch, which tries
> to make the output of /dev/random less predictable -- particularly for
> an attacker outside the box -- earlier.

This is a very positive work for the cases when system is used as a server
or other workloads which may involve cryptographic activity.  However, you
seem to assume that aggressive entropy collection is always preferable even
if it has extra costs.  This is simply not true.  There are small embedded
systems where aggressive entropy collection is not required and is actually
costly.  There are embedded systems where I happily run rlogin, because ssh
is simply not needed in that environment.

Please make this tunable.  Also, please make this tunable *before* merge,
otherwise this will just never happen.  Like cprng_fast/strong functions
were never optimised to not use spin-lock (at least they no longer use
IPL_HIGH, which should have been considered simply as a bug).

Few fragments which caught my eye while skimming through the diff..

>  #if defined(__HAVE_CPU_COUNTER)
> - if (cpu_hascounter())
> - return (cpu_counter32());
> + if (cpu_hascounter() && sizeof(cpu_counter() == sizeof(uint64_t))) {
> + return (cpu_counter());
> + }
>  #endif

??

> - rnd_add_uint32(&skewsrc, rnd_counter());
> - callout_schedule(&skew_callout, hz);
> + rnd_add_uint64(&skewsrc, rnd_counter());
> + callout_schedule(&skew_callout, hz / 10);

Do we really need to run 10 extra callouts per second to get an extra bit?
I did not investigate, but we have plenty of already existing callouts.
Is, or can, flipflop logic be used with them?  Or deltas between *all*
callouts in the system?

> +static void kprintf_rnd_get(size_t bytes, void *priv)
> +{
> + if (mutex_tryenter(&kprintf_mtx)) {
> + if (kprnd_added) {
> + SHA512_Final(kprnd_accum, &kprnd_sha);
> + rnd_add_data(&rnd_printf_source,
> +  kprnd_accum, sizeof(kprnd_accum), 0);
> + kprnd_added = 0;
> + /* This, we must do, since we called _Final. */
> + SHA512_Init(&kprnd_sha);
> + /* This is optional but seems useful. */
> + SHA512_Update(&kprnd_sha, kprnd_accum,
> +   sizeof(kprnd_accum));
> + }
> + mutex_exit(&kprintf_mtx);
> + }
> +}

Pre-checking kprnd_added first rather than locking/unlocking mutex all
the time would be a start.. but how about not performing SHA-512 while
holding IPL_HIGH mutex?  Not only embedded systems, but perhaps our VAX
or m68k users would also prefer to not have micro-freezes every second
while printing to a console.

> + kr = rnd_sources.lh_first;
> + start = rset->start;
> + while (kr != NULL && start > 1) {
> + kr = kr->list.le_next;
> + start--;
> + }

LIST_FIRST(), LIST_NEXT()?

-- 
Mindaugas


Re: Changes to make /dev/*random better sooner

2014-04-08 Thread Thor Lancelot Simon
On Tue, Apr 08, 2014 at 12:25:32AM -0400, Thor Lancelot Simon wrote:
>
> Attached are the changes from the tls-earlyentropy branch, which tries
> to make the output of /dev/random less predictable -- particularly for
> an attacker outside the box -- earlier.

I have observed a buglet though I haven't pinned down the cause yet.
On one of my systems, rndctl -l -v consistently displays the new
"printf" source twice.

Anyone else see it?

Thor


Re: Changes to make /dev/*random better sooner

2014-04-08 Thread Thor Lancelot Simon
On Tue, Apr 08, 2014 at 09:39:12AM +0200, Martin Husemann wrote:
> On Tue, Apr 08, 2014 at 12:25:32AM -0400, Thor Lancelot Simon wrote:
> > 2) Accumulate the output of kernel printf (as well as the times
> >when it's called) and add this periodically.  To avoid issues
> >with recursion through diagnostic printfs, we use SHA512 to
> >accumulate the printf output, then mix in its output.
> 
> I wonder if we should make this part a compile time option (typically 
> defaulting to on for most architectures - but I haven't tried on
> really the difference it makes for really slow machines). I do understand
> that kernel output is rare post boot, so it might not be a big deal.

Even *at* boot time, kernel output is rare for the relevant definition
of "rare" -- even on the slowest machine, hashing a couple of kilobytes
wil take only a tiny fraction of a second.

Thor


Re: Changes to make /dev/*random better sooner

2014-04-08 Thread Martin Husemann
On Tue, Apr 08, 2014 at 12:25:32AM -0400, Thor Lancelot Simon wrote:
> 2) Accumulate the output of kernel printf (as well as the times
>when it's called) and add this periodically.  To avoid issues
>with recursion through diagnostic printfs, we use SHA512 to
>accumulate the printf output, then mix in its output.

I wonder if we should make this part a compile time option (typically 
defaulting to on for most architectures - but I haven't tried on
really the difference it makes for really slow machines). I do understand
that kernel output is rare post boot, so it might not be a big deal.

Martin