On Wed, Apr 09, 2014 at 02:43:23AM +0100, Mindaugas Rasiukevicius wrote: > > This is a very positive work for the cases when system is used as a server > or other workloads which may involve cryptographic activity. However, you > seem to assume that aggressive entropy collection is always preferable even > if it has extra costs.
Can you give an example of this? How exactly will the changes on this branch measurably impact performance? Can you demonstrate with a benchmark? > Please make this tunable. Also, please make this tunable *before* merge, > otherwise this will just never happen. Like cprng_fast/strong functions > were never optimised to not use spin-lock (at least they no longer use > IPL_HIGH, which should have been considered simply as a bug). As I believe my benchmarks at the time amply demonstrated, the cprng_strong function is considerably more efficient than the entropy extraction code it replaced. Can you please explain how this is an example of my supposed disdain for performance? Would you prefer the system to be slower, as it was, when user processes ask for entropy, rather than faster, as I demonstrably made it? Unfortunately, the arc4random() implementation which I renamed to cprng_fast (almost two years ago, one should note) had a severe concurrency bug which had real security implications. I highlighted the use of a spin lock to fix that code as an emergency measure and at the time, I asked you if you could propose any other quick, simple fix. I do not recall that you proposed one. Another developer claimed at the time to have a lockless, concurrent replacement for the arc4random() code almost ready to commit. I won't call him or her out publically but we both know who it is and if you're going to pester someone about arc4random()/cprng_fast() I wish you'd pick the right target -- not me. Honestly I'm sick of getting beat up for other people's mistakes with arc4random. I think it's uncalled for. I wish you'd stop. I did not even touch any cprng functions in the current batch of changes and you are still complaining about them. Doesn't seem right to me. > Few fragments which caught my eye while skimming through the diff.. > > > #if defined(__HAVE_CPU_COUNTER) > > - if (cpu_hascounter()) > > - return (cpu_counter32()); > > + if (cpu_hascounter() && sizeof(cpu_counter() == sizeof(uint64_t))) { > > + return (cpu_counter()); > > + } > > #endif > > ?? We provide no MI API for obtaining a counter value of any known size except 32 bits, unfortunately. The instrumentation I added while developing these changes revealed that the delta entropy estimator was terribly broken due to wraparound; changing it to 64 bits is the fix. The code already fell back to microtime() if it couldn't get a counter, which is already the case on older platforms, so though this may look odd, I think it is essentially a 64-bit version of precisely what we did before. It would be nice to be able to get at all -- even not the currently selected -- timecounters directly in a non-horrible, non-abstraction-violating way. That might better address this as well as your concerns about the clock-skew randomness source below. I am hopeful that the current discussion of clocks on tech-kern will lead to generic functionality that could allow comparison of clocks as an entropy source with less gyrations. > > - rnd_add_uint32(&skewsrc, rnd_counter()); > > - callout_schedule(&skew_callout, hz); > > + rnd_add_uint64(&skewsrc, rnd_counter()); > > + callout_schedule(&skew_callout, hz / 10); > > Do we really need to run 10 extra callouts per second to get an extra bit? Are you seriously telling me that you are concerned that adding 9 callouts per second will measurably impact system performance? I would submit that the change I made in my previous work on rnd, to which you now seem to be retroactively objecting, to eliminate the constant use of 1-tick callouts by the entropy addition code, almost certainly more than makes up for any performance impact caused by adding 9 callouts per second here. > I did not investigate, but we have plenty of already existing callouts. > Is, or can, flipflop logic be used with them? Or deltas between *all* > callouts in the system? Do you really think that this kind of invasive change to the internals of the callout mechanism will be cheaper? > > +static void kprintf_rnd_get(size_t bytes, void *priv) > > +{ > > + if (mutex_tryenter(&kprintf_mtx)) { > > + if (kprnd_added) { > > + SHA512_Final(kprnd_accum, &kprnd_sha); > > + rnd_add_data(&rnd_printf_source, > > + kprnd_accum, sizeof(kprnd_accum), 0); > > + kprnd_added = 0; > > + /* This, we must do, since we called _Final. */ > > + SHA512_Init(&kprnd_sha); > > + /* This is optional but seems useful. */ > > + SHA512_Update(&kprnd_sha, kprnd_accum, > > + sizeof(kprnd_accum)); > > + } > > + mutex_exit(&kprintf_mtx); > > + } > > +} > > Pre-checking kprnd_added first rather than locking/unlocking mutex all > the time would be a start.. but how about not performing SHA-512 while > holding IPL_HIGH mutex? Not only embedded systems, but perhaps our VAX > or m68k users would also prefer to not have micro-freezes every second > while printing to a console. Glad to move the check to kprnd_added. Thanks. However, can you actually demonstrate that this purported effect is perceptible to humans? We're talking about hashing *bytes of output per minute* in the long run on most systems and perhaps a few hundred bytes per second in extreme cases. SHA512 may be slow, but even on ancient, slow CPUs, it does a lot better than a few hundred bytes within the human threshold of perception of several milliseconds. I cannot perceive the effect of this code run in unaccellerated QEMU on an old slow laptop, FWIW; I tested it in this way just to be sure. I'm sure it is perceptible if the system is stuck in a tight loop issuing printfs, but the system is *already* unusable in that case because it is stuck at IPL_HIGH. Note that to ensure debugging is not impeded by this code it doesn't fire if we are printing from DDB. > > + kr = rnd_sources.lh_first; > > + start = rset->start; > > + while (kr != NULL && start > 1) { > > + kr = kr->list.le_next; > > + start--; > > + } > > LIST_FIRST(), LIST_NEXT()? Per what I understand to be our guidelines, I try not to mix this kind of stylistic cleanup with more substantive changes. The older parts of the rnd code still have some very old constructs -- like this one, which I moved. I am glad to clean them up in another pass over the code. Thor