Re: new coreutil? shuffle - randomize file contents

Paul Eggert Thu, 02 Jun 2005 23:44:02 -0700

Philip Rowlands <[EMAIL PROTECTED]> writes:

> I'm still interested to read what Paul considers to be the
> difficulties of such an implementation?


Suppose you're randomizing an input file of 10 million lines.  And
suppose you want to approximate a "truly random" key by using a
128-bit random key for each input line.

Then you'll need about 1.3 billion random bits.  On my desktop,
/dev/random generates only about 200 random bits per second, and it'll
take me about 2.5 months to randomize the file.

If you're clever you can tune things so you need only about 0.22
billion random bits: this is because the log base 2 of (10 million
factorial) is about 220 million (I used Stirling's approximation).
But even with that optimization it'll still take me a couple of weeks
to randomize the file.

One way to be clever is to use the Knuth shuffle (a.k.a. the
Fisher-Yates shuffle).  However, this doesn't easily generalize to an
algorithm that will work efficiently if the input is too large to fit
into main memory.  There are other ways to be clever even for large
files, but I haven't had time to think them through.

And don't forget: even if we're clever, it still takes me a couple of
weeks to randomize a 10-million line file.

For more on this subject, please see:

http://en.wikipedia.org/wiki/Knuth_shuffle


_______________________________________________
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Re: new coreutil? shuffle - randomize file contents

Reply via email to