James> I think the consensus is that the functionality belongs in "sort". James> Beyond that things are a bit less clear. However, Paul put forward a James> proposed usage which adapts the current -k option (see James> http://lists.gnu.org/archive/html/bug-coreutils/2005-05/msg00179.html). James> Nobody made any comments suggesting that that wasn't a good way of James> doing things.
I liked my proposal for co-opting "-s": http://lists.gnu.org/archive/html/bug-coreutils/2005-05/msg00220.html In other words, rather than having a "random" column, you'd have a "row number" column, which would turn into a random column under the hash operation. The documentation for '-s' would be changed to suggest that it was enabling a comparison based on input row number. James> Last time this was discussed on this list there was some talk about James> selecting permutations with a seed. This makes me uncomfortable since James> though doubtless it would be convenient. The problem is that the James> number of permutations of the records of most files is larger than James> 2^64, and so representing the seed and making use of it would be very James> tricky. I then had a look at Fascicle 2 of volume 4 of Knuth, but it James> concentrates on generating all permutations, and so wasn't directly James> much help. This seems to be a common point of confusion which I think needs to be cleared up. The mapping from random seed to sample space is rarely surjective - in fact, whenever you sample more than a few numbers from a random number generator, it isn't. It only matters that the number of possible seeds is much larger than the number of samplings we would ever choose to perform, and if not 2^64 then definitely 2^128 will be sufficient for all eternity. (Maybe some theoretical common ground is needed. A random variable is a function X from a measurable space with probability measure P to another measurable space such that X^{-1}(B) is measurable for every measurable B in the range of X. We can define a probability measure on the range of X by Q(B) = P(X^{-1}(B)), called the distribution of X. Usually the domain of X is taken to be unspecified and uncountable, but since in practice we only ever sample a finite subset of it - and since we cannot calculate the inverse of X at arbitrary points - we can replace the domain with a suitably large finite set with discrete probability measure and still get good sample coverage, independently of the size of the range of X. This new domain is the seed domain.) Phil> Paul's statement: Phil> > It's not as easy to implement as it sounds, I'm afraid; even if you Phil> > can assume /dev/random or /dev/urandom. Phil> Phil> is a slight concern, but I imagine the problems are applicable to *any* Phil> shuffling utility. Is it that the app must guarantee all lines of a Phil> non-seekable stdin must have an equal chance of any sort order? See my comment to James above. I think one need not make this guarantee, since only a tiny fraction of possible sort orders will be able to be tried by the user. However, I think it would be true for my proposal (and the one that James suggested to Davis) if one took the random number generator or hash function to be variable or unspecified during analysis of the algorithm. The fact that we ship with a particular RNG or hash function (of all uncountable possibilities, some unimplementable) hard-coded is not a problem since its resources are never exhausted. Jim> It looks like there are some desirable features that can be Jim> provided only by a shuffle-enabled program that is key-aware. Jim> Key specification and the comparison code are already part of sort. Jim> Obviously, duplicating all of that in a separate program is not Jim> an option. I don't relish the idea of factoring out sort's line- Jim> and key-handling code either, but it might be feasible. Jim> Jim> However, I do like the idea of a new program that simply outputs Jim> a random permutation of its input records, and that does it well, Jim> and repeatably. The Unix tool philosophy certainly does encourage Jim> the `perform one task and do it well' approach. Since doing it Jim> well includes handling input larger than available virtual memory, Jim> this is not trivial -- and it is well suited to the coreutils, Jim> i.e., it's not easily implementable as a script. Jim> Jim> Initially, I was inclined to say that adding both the new program Jim> (no key support) and related functionality to sort was desirable. Jim> Thinking of the limits of robustness of such a new program, I Jim> realized that if the input is sufficiently large and not seekable Jim> (e.g., from a pipe), then the program will have to resort to writing Jim> temporary files, much as sort already does. More duplicated effort, Jim> determining how much memory to use (like sort's --buffer-size=SIZE Jim> option), managing the temporary files, ensuring that they're removed Jim> upon interrupt, etc. But maybe not prohibitive. The new program Jim> would also have to have an option like sort's -z, --zero-terminated Jim> option, and --temporary-directory=DIR, and --output=FILE. In effect, Jim> it would need all of sort's options that don't relate to sorting. I did have some sympathy for Davis' point that a specialized 'shuffle' can be implemented much more efficiently than one based on an algorithm tuned to mergesort... Although I don't see a problem with two utilities if someone wants to write the second one, and it is correct, I would tend to agree with you and prefer a key-aware version as a priority. However, I just realized that the algorithm James suggested to Davis is wrong for the same reason that the original random patch I responded to is wrong: http://lists.gnu.org/archive/html/bug-coreutils/2005-01/msg00145.html I'm sorry for not noticing this earlier. One possibility for an efficient random permutation of a large file is a program which scans the file once and records the location of each line in an index, then applies a uniformly distributed random permutation to this line index by stepping through it in order and swapping the current entry with an entry chosen at random from the set of later entries, and finally steps through the index again and dereferences each line entry and prints out the corresponding line in the original file. Note that there is a randomize-lines package in Debian by Arthur de Jong, I've cc'ed him. It's much faster than 'sort' on small files, but doesn't handle large files at all. Jim> So implementing a robust shuffle program, even one without key Jim> handling capabilities, would require much of the infrastructure Jim> already present in sort.c. Jim> Jim> It sure sounds like shuffle and sort should share a lot of code, Jim> one way or another, so why not have them share the line- and key- Jim> handling code, too? I won't rule out adding a new program, like Jim> shuffle, but I confess I'm less inclined now than when I started Jim> typing this message. Well, it would only be line-handling code, and whatever else, but no key-handling code since Davis' shuffle wouldn't be key-aware, unless you're talking about my proposal. And maybe common things could be put in a library, I don't know. I am against duplication but not against change. Frederik _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils