> Part of the problem -- as you'll see in the thread rooted at > <http://lists.gnu.org/archive/html/bug-coreutils/2004-09/msg00001.html> > -- is that it's a bit tricky to define exactly what a random sort is.
OK, that's a long thread. What is the question? E.g. a) how ties should be resolved The same way as before. I didn't think about numbers before reading the thread, but ... why not just canonicalize the key numerically if -n is specified? Using, say, sprintf("%f",...). You don't get exactly the same behavior, e.g. the final distinction between 3 and 3.0 disappears, but I think this is OK. b) what distribution over permutations Uniform. Obviously. (?) c) whether the sorting should depend on the order of the input keys or the keys themselves I want it to depend on the keys themselves. Assuming no duplicate keys, sort -R will produce the same output no matter what order the input. d) something else...? > > A better algorithm would be as follows: combine each key you want to > > randomize with some salt (your "random seed"), and compute its hash. > > That would be better, but it's hard for me to say whether it > accurately reflected what I wanted. I would want "sort --random" to > sort according to a random permutation of the correct sort order, with > ties retained. First, it sounds like you don't want to retain ties. > (I'm curious: why not?) Why don't you think I want to retain ties? If two keys are equal, and you take their hash, they are still equal. Maybe I don't understand what you mean by ties. > Second, it's not at all clear to me that the algorithm you describe > will actually result in a random permutation taken from a uniform > distribution of all permutations. It follows immediately from the fact that the output order in independent of the input order (since we do a full sort) and the contents of the keys (since we hash them). I.e. we can permute the input keys and we still have the same chance of getting a given permutation - restating, if X is a random permutation chosen by this algorithm on an input then P(X=M)=P(XA=M)\forall permutations A,M. >From this and the bijectivity of permutations it followss that P(X=M)=P(XA=M)\forall A=(substituting N^{-1}M for A)P(XN^{-1}M=M)=P(X=N) for any two permutations M,N, i.e. the distribution is uniform. > > I think a feature like this would be quite useful to all manner of > > users, from those involved in statistical investigations to those > > who want to randomize their mp3's without installing a separate > > piece of software. > > For the latter, almost anything will do; but what sort of statistical > investigations did you have in mind? That might affect what we mean > by "random sort". It's quite possible that there should be multiple > flavors of "random sort". Quite? I think there is really only one flavor of any general use. Perhaps you can suggest others. As for the nature of the investigations, well, anything for which one needs a random permutation, I suppose. Also, random sampling with sort -R | head, though somewhat inefficient, but convenient, and becoming more efficient when one needs a largish fraction (say >0.5%) of a large file. But neither of these needs anything other than a uniform distribution on permutations. So, if there is interest in implementing what I've described, let's try to spec out all the requirements. In particular, will the number canonicalization thing above be sufficient, is it important what the hash function is, etc. E.g. would something like md5 be too slow? Would something much simpler be insufficiently random? Regards, Frederik _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils