On Fri, Jun 03, 2005 at 06:27:18AM +0000, Davis Houlton wrote: > > One possibility for an efficient random permutation of a large file is > > a program which scans the file once and records the location of each > > line in an index, then applies a uniformly distributed random > > permutation to this line index by stepping through it in order and > > swapping the current entry with an entry chosen at random from the set > > of later entries, and finally steps through the index again and > > dereferences each line entry and prints out the corresponding line in > > the original file. > > Interesting points Frederik! Glossing over some trivial special case details > and taking a few syntax liberties, a 'back-of-napkin' style shuffle algorithm > supporting N sized data sets might look like.... > > struct index > index is a structure that allows us to fseek and read a given segement of > data > from the given file into memory > file > start > len > > calc_lines(temp_file_num): Lets assume indexes are stored not in memory, but > in temp files 0...X, and each temp file has a maximum amount of index > structures INDEX_MAX. This function, given a temp file counter, calculates > the number of indexes in all previous temp files. I'm using this here > primarily as a convention to simplify the psuedo-code below.
I don't see why you wouldn't just use one file. > random(data_size): Makes data_size/RAND_MAX calls to rand to make sure our > random number generator covers the entire data set. > > then the shuffle routine: > while input_size>=0 > random_line=random(input_size) > > //Figure out which temp files have our data > input_size_file=input_size/INDEX_MAX > random_line_file=random_line/INDEX_MAX > > //Figure out how many actual indexes we need to skip forward > //For example, if we can store 10 indexes per temp file, and we > //want index 18, we'd only skip over 7 lines in file 1. > input_size_pos = > sizeof(struct index) * > (input_size-calc_lines(input_size_file-1)) > random_line_pos= > sizeof(struct index)* > (random_line-calc_lines(random_line_file-1)) > > //Grab last index > open file input_size_file > seek forward input_size_pos > read in struct index last_line_index > close input_size_file > > //Grab random line index, and replace > //With last index > open file random_line_file > seek forward random_line_pos > read in struct index random_line_index > seek backward sizeof(struct index) > write last_line_index > close random_line_file > > //Write out random line > open file index.source > seek index.begin > print (read in index.len bytes) > close file index.source > > input_size-- > > The primary advantage above is only two passes are made; one to create the > indexes, and one to randomize the output. Even for very large files, I don't > know what the end efficiency is (I suspect not much), but I guess every > little bit helps. I didn't check the code but you're right that only two passes are necessary. It's almost certainly a lot more efficient than 'sort --random' would be. > Obviously, there are some gross simplifications...for example, most numbers > would be composed of two longs; I'm assuming we'll have to deal with line > counts > than LONG_MAX, so each line position would be composed of one long > for a page offset (for files > LONG_MAX in data-set size) and one for the > actual offset (when on the current page). That may be overkill, and if it is, > using just longs would greatly simplify implementation. Even then, we can't > handle infinitely large files--is that a true requirement? See /usr/include/features.h. If you define _LARGEFILE_SOURCE then you get functions for manipulating files with 64-bit sizes and offsets. 64 bits is definitely enough. Frederik -- http://ofb.net/~frederik/ _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils