On Mon, May 23, 2005 at 08:02:19PM +0000, Davis Houlton wrote: > On Monday 23 May 2005 16:35, you wrote: > > So, I think that "shuffle" is a good idea. > Great! As I wasn't sure if this was a good idea or not, right now the > functionality is quite minimal. I agree that it needs to be exapnded, and > will do what I can to do so.
Of course, I'm not one of the coreutils maintainers, so my opinion has no weight. > > Does it produce all possible outputs with equal probability? > I would say so. The randomizing function is that old standby > return 1+(rand()%line_count); You have a problem where a file has more than RAND_MAX lines. See below. [...] > redudancy--I will make those (right now I'm using a linked list to > store the file in memory. This really should be a pointer of > strings, of course). It's fast enough for me on 2ghz hardware, but > might as well go the extra mile. [...] > I also realized I need to add a caveat section--this utility is > memory bound (currently). If you have 4GB of file to randomize and > only 1GB of memory, well, thems the bricks. The obvious way around > this is to first scan and create an index of delimiter locations, > then fseek through a random walk of said delimiters--but I wasn't > sure if it was really neccessary at this point--can't imagine it > would be very fast, and then there's the issue of stdin. Probably > need to add this as a --use-index option. The GNU project likes to avoid arbitrary limits. Take a look at the way that 'sort' handles the same situation. Rather than doing that many random reads, it is probably more efficient to do something like this: 1. Read N lines from the input file(s), where N is chosen such that the data fits into memory. 2. Shuffle each chunk of N lines, and write the shuffled data to a temorary file. Continue until all the input has been read and sent to a temporary file (if the input all fits into memory, skip to step 4). 3. Read the temporary files back, selecting the next line from a randomly selected temporary file until all the temporary files have been completely read 4. On the final merge pass (if any) send the data to the output file instead of a temporary file. Delete all the temporary files (you need to do this manually, since the resource limit on the number of temporary files prevents you using tmpfile()). This is quite complex to implement since you can only merge M temporary files in the merge pass, where (M < RAND_MAX) and (M < the RLIMIT_NOFILE resource limit). Because of this complexity, you might want to get some feedback from the coreutils maintainers before you consider making that change because it's quite significant. Regards, James Youngman. _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils