> Part of the problem -- as you'll see in the thread rooted at
> <http://lists.gnu.org/archive/html/bug-coreutils/2004-09/msg00001.html>
> -- is that it's a bit tricky to define exactly what a random sort is.

OK, that's a long thread. What is the question? E.g.

a) how ties should be resolved

The same way as before. I didn't think about numbers before reading
the thread, but ... why not just canonicalize the key numerically if
-n is specified? Using, say, sprintf("%f",...). You don't get exactly
the same behavior, e.g. the final distinction between 3 and 3.0
disappears, but I think this is OK.

b) what distribution over permutations

Uniform. Obviously. (?)

c) whether the sorting should depend on the order of the input keys or
the keys themselves

I want it to depend on the keys themselves. Assuming no duplicate
keys, sort -R will produce the same output no matter what order the
input.

d) something else...?

> > A better algorithm would be as follows: combine each key you want to
> > randomize with some salt (your "random seed"), and compute its hash.
> 
> That would be better, but it's hard for me to say whether it
> accurately reflected what I wanted.  I would want "sort --random" to
> sort according to a random permutation of the correct sort order, with
> ties retained.  First, it sounds like you don't want to retain ties.
> (I'm curious: why not?) 

Why don't you think I want to retain ties? If two keys are equal, and
you take their hash, they are still equal. Maybe I don't understand
what you mean by ties.

> Second, it's not at all clear to me that the algorithm you describe
> will actually result in a random permutation taken from a uniform
> distribution of all permutations.

It follows immediately from the fact that the output order in
independent of the input order (since we do a full sort) and the
contents of the keys (since we hash them). I.e. we can permute the
input keys and we still have the same chance of getting a given
permutation - restating, if X is a random permutation chosen by this
algorithm on an input then P(X=M)=P(XA=M)\forall permutations A,M.
>From this and the bijectivity of permutations it followss that
P(X=M)=P(XA=M)\forall A=(substituting N^{-1}M for
A)P(XN^{-1}M=M)=P(X=N) for any two permutations M,N, i.e. the
distribution is uniform.

> > I think a feature like this would be quite useful to all manner of
> > users, from those involved in statistical investigations to those
> > who want to randomize their mp3's without installing a separate
> > piece of software.
> 
> For the latter, almost anything will do; but what sort of statistical
> investigations did you have in mind?  That might affect what we mean
> by "random sort".  It's quite possible that there should be multiple
> flavors of "random sort".

Quite? I think there is really only one flavor of any general use.
Perhaps you can suggest others. As for the nature of the
investigations, well, anything for which one needs a random
permutation, I suppose. Also, random sampling with sort -R | head,
though somewhat inefficient, but convenient, and becoming more
efficient when one needs a largish fraction (say >0.5%) of a large
file. But neither of these needs anything other than a uniform
distribution on permutations.

So, if there is interest in implementing what I've described, let's
try to spec out all the requirements. In particular, will the number
canonicalization thing above be sufficient, is it important what the
hash function is, etc. E.g. would something like md5 be too slow?
Would something much simpler be insufficiently random?

Regards,

Frederik


_______________________________________________
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Reply via email to