So, the prototype runs a little slower than I had expected - it's currently using md5 hashes, I could also look into CRC or something faster (but less secure, for those concerned). Anyway here is a sample:
$ time ./sort -R < /usr/share/dict/words > /dev/null ./sort -R < /usr/share/dict/words > /dev/null 1.67s user 0.01s system 99% cpu 1.684 total $ time ./sort < /usr/share/dict/words > /dev/null ./sort < /usr/share/dict/words > /dev/null 0.45s user 0.01s system 99% cpu 0.462 total $ time rl < /usr/share/dict/words > /dev/null rl < /usr/share/dict/words > /dev/null 0.07s user 0.01s system 99% cpu 0.074 total $ wc -l /usr/share/dict/words 96274 /usr/share/dict/words 'rl' is the executable from Arthur's randomize-lines. Of course, this 'sort'-based shuffle has the advantage of being independent of input order $ print -l g f e d c b a | ./sort -R | md5sum dda0a6660319917afd6ed021f27fb452 - $ print -l a b c d e f g | ./sort -R | md5sum dda0a6660319917afd6ed021f27fb452 - and being field-aware $ print -l "a 1" "a 2" "b 2" "b 03" "c 1" "c 2" | ./sort -k 1,1R -k 2,2n c 1 c 2 a 1 a 2 b 2 b 03 - whatever these are worth. I just wanted to make sure the slowness is acceptable before proceeding. Frederik On Thu, Jun 02, 2005 at 02:16:10PM -0700, Frederik Eaton wrote: > James> I think the consensus is that the functionality belongs in "sort". > James> Beyond that things are a bit less clear. However, Paul put forward a > James> proposed usage which adapts the current -k option (see > James> http://lists.gnu.org/archive/html/bug-coreutils/2005-05/msg00179.html). > James> Nobody made any comments suggesting that that wasn't a good way of > James> doing things. > > I liked my proposal for co-opting "-s": > > http://lists.gnu.org/archive/html/bug-coreutils/2005-05/msg00220.html > > In other words, rather than having a "random" column, you'd have a > "row number" column, which would turn into a random column under the > hash operation. The documentation for '-s' would be changed to suggest > that it was enabling a comparison based on input row number. > > James> Last time this was discussed on this list there was some talk about > James> selecting permutations with a seed. This makes me uncomfortable since > James> though doubtless it would be convenient. The problem is that the > James> number of permutations of the records of most files is larger than > James> 2^64, and so representing the seed and making use of it would be very > James> tricky. I then had a look at Fascicle 2 of volume 4 of Knuth, but it > James> concentrates on generating all permutations, and so wasn't directly > James> much help. > > This seems to be a common point of confusion which I think needs to be > cleared up. The mapping from random seed to sample space is rarely > surjective - in fact, whenever you sample more than a few numbers from > a random number generator, it isn't. It only matters that the number > of possible seeds is much larger than the number of samplings we would > ever choose to perform, and if not 2^64 then definitely 2^128 will be > sufficient for all eternity. > > (Maybe some theoretical common ground is needed. A random variable is a > function X from a measurable space with probability measure P to > another measurable space such that X^{-1}(B) is measurable for every > measurable B in the range of X. We can define a probability measure on > the range of X by Q(B) = P(X^{-1}(B)), called the distribution of X. > > Usually the domain of X is taken to be unspecified and uncountable, > but since in practice we only ever sample a finite subset of it - and > since we cannot calculate the inverse of X at arbitrary points - we > can replace the domain with a suitably large finite set with discrete > probability measure and still get good sample coverage, independently > of the size of the range of X. This new domain is the seed domain.) > > Phil> Paul's statement: > Phil> > It's not as easy to implement as it sounds, I'm afraid; even if you > Phil> > can assume /dev/random or /dev/urandom. > Phil> > Phil> is a slight concern, but I imagine the problems are applicable to *any* > Phil> shuffling utility. Is it that the app must guarantee all lines of a > Phil> non-seekable stdin must have an equal chance of any sort order? > > See my comment to James above. I think one need not make this > guarantee, since only a tiny fraction of possible sort orders will be > able to be tried by the user. However, I think it would be true for my > proposal (and the one that James suggested to Davis) if one took the > random number generator or hash function to be variable or unspecified > during analysis of the algorithm. The fact that we ship with a > particular RNG or hash function (of all uncountable possibilities, > some unimplementable) hard-coded is not a problem since its resources > are never exhausted. > > Jim> It looks like there are some desirable features that can be > Jim> provided only by a shuffle-enabled program that is key-aware. > Jim> Key specification and the comparison code are already part of sort. > Jim> Obviously, duplicating all of that in a separate program is not > Jim> an option. I don't relish the idea of factoring out sort's line- > Jim> and key-handling code either, but it might be feasible. > Jim> > Jim> However, I do like the idea of a new program that simply outputs > Jim> a random permutation of its input records, and that does it well, > Jim> and repeatably. The Unix tool philosophy certainly does encourage > Jim> the `perform one task and do it well' approach. Since doing it > Jim> well includes handling input larger than available virtual memory, > Jim> this is not trivial -- and it is well suited to the coreutils, > Jim> i.e., it's not easily implementable as a script. > Jim> > Jim> Initially, I was inclined to say that adding both the new program > Jim> (no key support) and related functionality to sort was desirable. > Jim> Thinking of the limits of robustness of such a new program, I > Jim> realized that if the input is sufficiently large and not seekable > Jim> (e.g., from a pipe), then the program will have to resort to writing > Jim> temporary files, much as sort already does. More duplicated effort, > Jim> determining how much memory to use (like sort's --buffer-size=SIZE > Jim> option), managing the temporary files, ensuring that they're removed > Jim> upon interrupt, etc. But maybe not prohibitive. The new program > Jim> would also have to have an option like sort's -z, --zero-terminated > Jim> option, and --temporary-directory=DIR, and --output=FILE. In effect, > Jim> it would need all of sort's options that don't relate to sorting. > > I did have some sympathy for Davis' point that a specialized 'shuffle' > can be implemented much more efficiently than one based on an > algorithm tuned to mergesort... Although I don't see a problem with > two utilities if someone wants to write the second one, and it is > correct, I would tend to agree with you and prefer a key-aware version > as a priority. > > However, I just realized that the algorithm James suggested to Davis > is wrong for the same reason that the original random patch I > responded to is wrong: > > http://lists.gnu.org/archive/html/bug-coreutils/2005-01/msg00145.html > > I'm sorry for not noticing this earlier. > > One possibility for an efficient random permutation of a large file is > a program which scans the file once and records the location of each > line in an index, then applies a uniformly distributed random > permutation to this line index by stepping through it in order and > swapping the current entry with an entry chosen at random from the set > of later entries, and finally steps through the index again and > dereferences each line entry and prints out the corresponding line in > the original file. > > Note that there is a randomize-lines package in Debian by Arthur de > Jong, I've cc'ed him. It's much faster than 'sort' on small files, but > doesn't handle large files at all. > > Jim> So implementing a robust shuffle program, even one without key > Jim> handling capabilities, would require much of the infrastructure > Jim> already present in sort.c. > Jim> > Jim> It sure sounds like shuffle and sort should share a lot of code, > Jim> one way or another, so why not have them share the line- and key- > Jim> handling code, too? I won't rule out adding a new program, like > Jim> shuffle, but I confess I'm less inclined now than when I started > Jim> typing this message. > > Well, it would only be line-handling code, and whatever else, but no > key-handling code since Davis' shuffle wouldn't be key-aware, unless > you're talking about my proposal. And maybe common things could be put > in a library, I don't know. I am against duplication but not against > change. > > Frederik > -- http://ofb.net/~frederik/ _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils