Sorry if this message comes more than once: I seem to be having bit of trouble subscribing properly.
There seems to be some sloppy thinking regarding efficiency and uniform randomness. Regarding uniform randomness, the infamous Oleg of comp.lang.{scheme,functional} writes: Furthermore, if we have a sequence of N elements and associate with each element a key -- a random number uniformly distributed within [0, M-1] (where N!>M>=N), we have the configurational space of size M^N (i.e., M^N ways to key the sequence). There are N! possible permutations of the sequence. Alas, for N>2 and M<N!, M^N is not evenly divisible by N!. Therefore, certain permutations are bound to be a bit more likely than the others. (see http://okmij.org/ftp/Haskell/perfect-shuffle.txt ) I think this is a potential problem with some of the suggestions for shuffling methods. I don't know how to combine sorting with shuffling properly, so I won't comment on that now. In the following, please note that I am not an experienced system programmer, so some of my thoughts could be totally wrong. Regarding efficiency, there are several cases: 1. The file is small In this case, the program will be fast unless someone does something super-stupid. The program speed here will probably be limited by the overhead of the system calls used to read and write files, and (if /dev/urandom is used) the time to generate random numbers. Tweaking performance in this case is probably not worth the effort, and certainly should not be done without careful profiling. 2. The file is large In this case, some care is in order, but I don't think a fancy algorithm will do any better than a simple one. The process speed will be limited by paging. Optimization should focus on preventing unnecessary page faults. In any case, the file must be read through from start to finish to identify line breaks, and then the lines must be read out. It's important to avoid adding a third pass by carelessness. Since there is no way to know how many lines there will be in the file before it's entirely processed, we will have to take a guess based on a reasonable line length and then correct it if necessary, probably using realloc. It would also be possible to use a temporary file for the line indices and then mmap it for use, but I suspect this will rarely be a win. Empty lines should be easy to detect, and since they are all the same, there is no sense in storing an index for each of them. Instead, just count how many there are in the file. On systems with mmap and madvise, the file should be mapped, the OS advised of sequential use, the file scanned for line breaks, the line break array shuffled, the OS advised of random use, and then the lines read to output through a buffer. The best thing about this method is that it should work well for both large and small files. If a file (probably a device) does not support seeking (and therefore presumably cannot be mapped either), then the entire file has to be read into memory, or into a temporary file. There are two tricky parts here: there is generally no way to know how many bites will be coming, so we will have to allocate buffer space dynamically. Probably the easiest way to do this, but not necessarily the best, is to copy the input to a temporary file, and then mmap the temporary file for readout. The good thing about this is that we don't have to grow the buffer ourselves. The other easy possibility is to use realloc, which may or may not be hideously inefficient. The last possibility I can think of is to use an array of arrays, but that seems on the messy side, especially when using CR/LF newlines, and also because it makes it harder to deal efficiently with the case that read only fills the buffer part way. Everything gets messier of course on a system that does not support mmap. In particular, with large files it will be necessary to skip around the file with lseek and read. Not fun at all. The last major problem is to choose the shuffling algorithm itself. For any reasonably sized file a really simple shuffle will be best. Just fill an array with an increasing sequence 0,1,2, etc., shuffle it with a Knuth shuffle, and output the lines in that order, through a buffer. If the file is ridiculously huge, it would seem to be good to find an algorithm that would allow small parts to be shuffled separately. I have not yet seen such an algorithm that actually worked. I will think about it some more. David Feuer _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils