I'm thinking about getting enough random numbers for a server that
does a lot of crypto, e.g. an IPSEC gateway, a web server doing
SSL, perhaps a PKI server.

On a heavily loaded server running FreeS/WAN IPSEC

http://www.xs4all.nl/~freeswan

you might have something of the order of an IPSEC connection being
rekeyed on average once a second (e.g. 3600 connections with hourly
rekey) and needing about 50 bytes per rekey. Loads are bursty;
average might be < 100 bytes/sec, but peak would be far higher. 

Anyway, I suspect that something delivering a few K bytes/sec of
good randomness would handle nearly any application.

This matches capacity of commercial hardware generators:

http://www.protego.se/random_hardware_en.htm   9500 bytes/sec
http://pikp28.uni-muenster.de/~hoppep/zzeng.html 5K bits/sec

The two standard ways to generate random numbers are to use some
hardware device that exhibits random behaviour (like an oddly biased
diode), as in the devices above, or to collect lots of somewhat
random data from mouse movements, interrupts, etc. e.g. PGP's
randomness collection or Linux /dev/random:

http://www.openpgp.net/random

In either case you need to do some hash-like things to compress
somewhat random inputs into highly random outputs and some bufferring
to cope with bursty demand. Details in RFC 1750.

Proposal:

Could we do a large part of this with a fairly simple chip, all
digital, without diodes etc.? A system bus has typically at least
32 data and 32 address lines plus a bunch of control signals.
Perhaps 80 bits that can be sampled at perhaps 100 MHz. 10 Mbytes/sec.
Crunch that down to 10K highly random bytes?

Some bits have dreadful entropy of course: e.g. on systems with under
a gig of RAM, top two address lines are always 0. On any system,
bottom two are usually 0 for aligned access. All ASCII characters
have top bit 0, much numeric data is small integers, etc. Interrupt
lines should spend most of their time in the non-asserted state. 

I have a crunching mechanism to propose. Nyberg showed that perfect
s-boxes (all columns and all linear combinations of columns are bent
boolean functions) can be built whenever there are twice as many input
bits as output bits. Build a substitution-permutation network with
perfect s-boxes.

Details would depend on analysis of input entropy (how much crunching
is needed) and of target chip technology (what is pratical) but it
might look something like:

Use 8 by 4 s-boxes throughout since they don't use much chip real
estate and generating perfect s-boxes this size looks feasible. I'd
like to add some 12 by 6 ones, but am not sure it is practical.

Take your 80 inputs bits and split them up according to expected
entropy. Perhaps 8 are expected to have reasonable entropy. Pass
them to the next stage immediately. 16 fair entropy? Crunch to 8
and pass them on the next clock. Remaining 56, crunched twice,
give 14 bits on 3rd clock. So one bus read yields 30 bits spread
across 3 clocks, going into main cruncher.

Suppose our estimate is that those bits have 2 bits of entropy per
32-bit word. Then we need 16-to-1 compression in the next phase.
128->64 using 16 s-boxes, 64->32 using 8, 32->16 using 4, 16->8
using 2.

We've used 30 lookups in 8 by 4 s-boxes to generate a byte. Compare
to block ciphers. DES does 128 lookups in 6 by 4 s-boxes per block
encryption, or 16 per byte. CAST-128 or Blowfish do 64 lookups per
block in 8 by 32 s-boxes, or 8 lookups per byte. Their s-boxes are
8 times wider than ours, though, so they're gettin 8 times as much
s-box info per lookup. Overall, they use about twice as much s-box
info per byte processed as we do.

We could match this by adding another compression round, or exceed
it with two or three. Given that we need roughly 1000-to-1 compression
overall (Mbyes/sec in, Kybtes/sec out) and the above only gives
16-to-1, we'd likely do this. 

Our Nyberg s-boxes are stronger than those in DES, CAST or Blowfish.
Of those, only CAST has bent functions in the columns. Both DES and
CAST attempt to optimise s-boxes (Blowfish makes them large and
random, but not optimised) but neither can use perfect s-boxes
because those do not exist in 6*4 or 8*32 sizes, only at 2n*n.

Also, DES re-uses the same 8 s-boxes in every round and CAST-128 or
Blowfish re-use their four bigger ones. Depending on chip real estate,
we'd try to use as many different s-boxes as possible. Methinks at a
minimum, we'd need 32 8 by 4 s-boxes (4K bytes), preferably a few
times that with a few 12 by 6 or 12 by 8 (4K each) for good measure.

Of course we permute between layers so the output bits from each
s-box in one round go to 4 different boxes in the next round. We
could also easily add feedback; take a few output bits somewhere
and use them as additional inputs to previous rounds.

If there's room, I'd use a few 12 by 8 s-boxes where a 12 by 6 subset
is Nyberg-strong and provides the output bits. The other two columns
are bent and reasonably non-linear in relation to the 1st 6. They
provide feedback bits.

Comments?

Reply via email to