Re: new coreutil? shuffle - randomize file contents

Davis Houlton Fri, 03 Jun 2005 04:21:16 -0700

> One possibility for an efficient random permutation of a large file is
> a program which scans the file once and records the location of each
> line in an index, then applies a uniformly distributed random
> permutation to this line index by stepping through it in order and
> swapping the current entry with an entry chosen at random from the set
> of later entries, and finally steps through the index again and
> dereferences each line entry and prints out the corresponding line in
> the original file.


Interesting points Frederik! Glossing over some trivial special case details 
and taking a few syntax liberties, a 'back-of-napkin' style shuffle algorithm 
supporting N sized data sets might look like....

struct index 
index is a structure that allows us to fseek and read a given segement of data 
from the given file into memory
    file
    start
    len

calc_lines(temp_file_num): Lets assume indexes are stored not in memory, but 
in temp files 0...X, and each temp file has a maximum amount of index 
structures INDEX_MAX.  This function, given a temp file counter, calculates 
the number of indexes in all previous temp files. I'm using this here 
primarily as a convention to simplify the psuedo-code below.

random(data_size): Makes data_size/RAND_MAX calls to rand to make sure our 
random number generator covers the entire data set.  

then the shuffle routine:
while input_size>=0
   random_line=random(input_size)

   //Figure out which temp files have our data
   input_size_file=input_size/INDEX_MAX
   random_line_file=random_line/INDEX_MAX

   //Figure out how many actual indexes we need to skip forward
   //For example, if we can store 10 indexes per temp file, and we 
   //want index 18, we'd only skip over 7 lines in file 1.
   input_size_pos = 
       sizeof(struct index) *
       (input_size-calc_lines(input_size_file-1))
   random_line_pos=
       sizeof(struct index)*
       (random_line-calc_lines(random_line_file-1))

  //Grab last index
   open file input_size_file
   seek forward input_size_pos
   read in struct index last_line_index
   close input_size_file

   //Grab random line index, and replace
   //With last index
   open file random_line_file
   seek forward random_line_pos 
   read in struct index random_line_index
   seek backward sizeof(struct index)
   write last_line_index
   close random_line_file

   //Write out random line 
   open file index.source
   seek index.begin
   print (read in index.len bytes)
   close file index.source
   
   input_size--

The primary advantage above is only two passes are made; one to create the 
indexes, and one to randomize the output. Even for very large files, I don't 
know what the end efficiency is (I suspect not much), but I guess every 
little bit helps.

Obviously, there are some gross simplifications...for example, most numbers 
would be composed of two longs; I'm assuming we'll have to deal with line 
counts > than LONG_MAX, so each line position would be composed of one long 
for a page offset (for files > LONG_MAX in data-set size) and one for the 
actual offset (when on the current page). That may be overkill, and if it is, 
using just longs would greatly simplify implementation.  Even then, we can't 
handle infinitely large files--is that a true requirement?

Thanks,
  Davis


_______________________________________________
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Re: new coreutil? shuffle - randomize file contents

Reply via email to