Re: new coreutil? shuffle - randomize file contents

David Feuer Thu, 02 Jun 2005 20:11:45 -0700

Sorry if this message comes more than once:  I seem to be having bit of
trouble subscribing properly.



There seems to be some sloppy thinking regarding efficiency and uniform
randomness.  Regarding uniform randomness, the infamous Oleg of
comp.lang.{scheme,functional} writes:

Furthermore, if we have a sequence of N elements and associate with
each element a key -- a random number uniformly distributed within [0,
M-1] (where N!>M>=N), we have the configurational space of size M^N
(i.e., M^N ways to key the sequence). There are N! possible
permutations of the sequence. Alas, for N>2 and M<N!, M^N is not
evenly divisible by N!. Therefore, certain permutations are bound to
be a bit more likely than the others.

(see http://okmij.org/ftp/Haskell/perfect-shuffle.txt )

I think this is a potential problem with some of the suggestions for
shuffling methods.  I don't know how to combine sorting with shuffling
properly, so I won't comment on that now.

In the following, please note that I am not an experienced system
programmer, so some of my thoughts could be totally wrong.


Regarding efficiency, there are several cases:
1.  The file is small
In this case, the program will be fast unless someone does something
super-stupid.  The program speed here will probably be limited by the
overhead of the system calls used to read and write files, and (if
/dev/urandom is used) the time to generate random numbers.  Tweaking
performance in this case is probably not worth the effort, and certainly
should not be done without careful profiling.

2.  The file is large

In this case, some care is in order, but I don't think a fancy algorithm
will do any better than a simple one.  The process speed will be limited
by paging.  Optimization should focus on preventing unnecessary page
faults.  In any case, the file must be read through from start to finish
to identify line breaks, and then the lines must be read out.  It's
important to avoid adding a third pass by carelessness.  Since there is
no way to know how many lines there will be in the file before it's
entirely processed, we will have to take a guess based on a reasonable
line length and then correct it if necessary, probably using realloc.
It would also be possible to use a temporary file for the line indices
and then mmap it for use, but I suspect this will rarely be a win.
Empty lines should be easy to detect, and since they are all the same,
there is no sense in storing an index for each of them.  Instead, just
count how many there are in the file.

On systems with mmap and madvise, the file should be mapped, the OS
advised of sequential use, the file scanned for line breaks, the line
break array shuffled, the OS advised of random use, and then the lines
read to output through a buffer.  The best thing about this method is
that it should work well for both large and small files.  If a file
(probably a device) does not support seeking (and therefore presumably
cannot be mapped either), then the entire file has to be read into
memory, or into a temporary file.  There are two tricky parts here:
there is generally no way to know how many bites will be coming, so we
will have to allocate buffer space dynamically.  Probably the easiest
way to do this, but not necessarily the best, is to copy the input to a
temporary file, and then mmap the temporary file for readout.  The good
thing about this is that we don't have to grow the buffer ourselves.
The other easy possibility is to use realloc, which may or may not be
hideously inefficient.  The last possibility I can think of is to use an
array of arrays, but that seems on the messy side, especially when using
CR/LF newlines, and also because it makes it harder to deal efficiently
with the case that read only fills the buffer part way.

Everything gets messier of course on a system that does not support
mmap.  In particular, with large files it will be necessary to skip
around the file with lseek and read.  Not fun at all.

The last major problem is to choose the shuffling algorithm itself.  For
any reasonably sized file a really simple shuffle will be best.  Just
fill an array with an increasing sequence 0,1,2, etc., shuffle it with a
Knuth shuffle, and output the lines in that order, through a buffer.

If the file is ridiculously huge, it would seem to be good to find an
algorithm that would allow small parts to be shuffled separately.  I
have not yet seen such an algorithm that actually worked.  I will think
about it some more.

David Feuer


_______________________________________________
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Re: new coreutil? shuffle - randomize file contents

Reply via email to