Frederik Eaton <[EMAIL PROTECTED]> writes: > You don't get exactly the same behavior, e.g. the final distinction > between 3 and 3.0 disappears, but I think this is OK.
But this would cause the randomization option to not be orthogonal to the rest of the options. The resulting behavior would be hard to document, and it would be hard to handle bug reports when it didn't behave the way users expected. We should have a simple way to explain the proposed behavior, a way that works in general. > c) whether the sorting should depend on the order of the input keys or > the keys themselves > > I want it to depend on the keys themselves. No, this issue should be orthogonal to the rest of "sort". In the rest of "sort", input order matters when you specify the -s (--stable) option; otherwise it doesn't matter. Randomness shouldn't change this. > Why don't you think I want to retain ties? If two keys are equal, and > you take their hash, they are still equal. But you appended a different random number to each key, right? So they won't be equal after that. Perhaps I didn't understand your proposed algorithm. Did you intend to (say) append the same random number to each key, compute an MD5 checksum of the result, and then sort based on the checksum? > As for the nature of the investigations, well, anything for which > one needs a random permutation, I suppose. Also, random sampling > with sort -R | head, though somewhat inefficient, but convenient, But these uses should not attempt to sort ties together. They should attempt to sort them separately. Perhaps it would be better to think of things in a completely different way. Currently "sort" lets you specify keys that can be any contiguous set of input columns. Perhaps we should extend the syntax for keys, so that the option "-k random" means to use an imaginary extra column of completely random values. That would be easy to explain to people. It would also let people decide for themselves what to do with ties. One problem with implementing this idea efficiently is that no matter what size you use for the random values, it's wrong. If the size is too small, you may get collisions. If it's too large, "sort" will be too inefficient. Besides, you shouldn't have to store the random values at all -- they're just an internal concept. For example, when the sort is so large that you need temporary files, there should be no need to save the random numbers in those files, as you can always generate new ones later. But doing this correctly, so that the result is still a uniform distribution of permutations, will be a bit tricky. > would something like md5 be too slow? Would something much simpler > be insufficiently random? Why not let the user specify a source for randomness? The sufficiently paranoid can specify /dev/random. Those who want reproducible results can specify a simple pseudorandom generator. _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils