> One possibility for an efficient random permutation of a large file is > a program which scans the file once and records the location of each > line in an index, then applies a uniformly distributed random > permutation to this line index by stepping through it in order and > swapping the current entry with an entry chosen at random from the set > of later entries, and finally steps through the index again and > dereferences each line entry and prints out the corresponding line in > the original file.
Interesting points Frederik! Glossing over some trivial special case details and taking a few syntax liberties, a 'back-of-napkin' style shuffle algorithm supporting N sized data sets might look like.... struct index index is a structure that allows us to fseek and read a given segement of data from the given file into memory file start len calc_lines(temp_file_num): Lets assume indexes are stored not in memory, but in temp files 0...X, and each temp file has a maximum amount of index structures INDEX_MAX. This function, given a temp file counter, calculates the number of indexes in all previous temp files. I'm using this here primarily as a convention to simplify the psuedo-code below. random(data_size): Makes data_size/RAND_MAX calls to rand to make sure our random number generator covers the entire data set. then the shuffle routine: while input_size>=0 random_line=random(input_size) //Figure out which temp files have our data input_size_file=input_size/INDEX_MAX random_line_file=random_line/INDEX_MAX //Figure out how many actual indexes we need to skip forward //For example, if we can store 10 indexes per temp file, and we //want index 18, we'd only skip over 7 lines in file 1. input_size_pos = sizeof(struct index) * (input_size-calc_lines(input_size_file-1)) random_line_pos= sizeof(struct index)* (random_line-calc_lines(random_line_file-1)) //Grab last index open file input_size_file seek forward input_size_pos read in struct index last_line_index close input_size_file //Grab random line index, and replace //With last index open file random_line_file seek forward random_line_pos read in struct index random_line_index seek backward sizeof(struct index) write last_line_index close random_line_file //Write out random line open file index.source seek index.begin print (read in index.len bytes) close file index.source input_size-- The primary advantage above is only two passes are made; one to create the indexes, and one to randomize the output. Even for very large files, I don't know what the end efficiency is (I suspect not much), but I guess every little bit helps. Obviously, there are some gross simplifications...for example, most numbers would be composed of two longs; I'm assuming we'll have to deal with line counts > than LONG_MAX, so each line position would be composed of one long for a page offset (for files > LONG_MAX in data-set size) and one for the actual offset (when on the current page). That may be overkill, and if it is, using just longs would greatly simplify implementation. Even then, we can't handle infinitely large files--is that a true requirement? Thanks, Davis _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils