Re: new coreutil? shuffle - randomize file contents

Frederik Eaton Sun, 05 Jun 2005 09:45:10 -0700

So, the prototype runs a little slower than I had expected - it's
currently using md5 hashes, I could also look into CRC or something
faster (but less secure, for those concerned). Anyway here is a
sample:


$ time ./sort -R < /usr/share/dict/words > /dev/null
./sort -R < /usr/share/dict/words > /dev/null  1.67s user 0.01s system 99% cpu 
1.684 total
$ time ./sort < /usr/share/dict/words > /dev/null   
./sort < /usr/share/dict/words > /dev/null  0.45s user 0.01s system 99% cpu 
0.462 total
$ time rl < /usr/share/dict/words > /dev/null       
rl < /usr/share/dict/words > /dev/null  0.07s user 0.01s system 99% cpu 0.074 
total
$ wc -l /usr/share/dict/words
96274 /usr/share/dict/words

'rl' is the executable from Arthur's randomize-lines. Of course, this
'sort'-based shuffle has the advantage of being independent of input
order

$ print -l g f e d c b a | ./sort -R | md5sum
dda0a6660319917afd6ed021f27fb452  -
$ print -l a b c d e f g | ./sort -R | md5sum
dda0a6660319917afd6ed021f27fb452  -

and being field-aware

$ print -l "a 1" "a 2" "b 2" "b 03" "c 1" "c 2" | ./sort -k 1,1R -k 2,2n
c 1
c 2
a 1
a 2
b 2
b 03

- whatever these are worth. I just wanted to make sure the slowness is
acceptable before proceeding.

Frederik

On Thu, Jun 02, 2005 at 02:16:10PM -0700, Frederik Eaton wrote:
> James> I think the consensus is that the functionality belongs in "sort".
> James> Beyond that things are a bit less clear.  However, Paul put forward a
> James> proposed usage which adapts the current -k option (see
> James> http://lists.gnu.org/archive/html/bug-coreutils/2005-05/msg00179.html).
> James> Nobody made any comments suggesting that that wasn't a good way of
> James> doing things.
> 
> I liked my proposal for co-opting "-s":
> 
> http://lists.gnu.org/archive/html/bug-coreutils/2005-05/msg00220.html
> 
> In other words, rather than having a "random" column, you'd have a
> "row number" column, which would turn into a random column under the
> hash operation. The documentation for '-s' would be changed to suggest
> that it was enabling a comparison based on input row number.
> 
> James> Last time this was discussed on this list there was some talk about
> James> selecting permutations with a seed.  This makes me uncomfortable since
> James> though doubtless it would be convenient.  The problem is that the
> James> number of permutations of the records of most files is larger than
> James> 2^64, and so representing the seed and making use of it would be very
> James> tricky.  I then had a look at Fascicle 2 of volume 4 of Knuth, but it
> James> concentrates on generating all permutations, and so wasn't directly
> James> much help.
> 
> This seems to be a common point of confusion which I think needs to be
> cleared up. The mapping from random seed to sample space is rarely
> surjective - in fact, whenever you sample more than a few numbers from
> a random number generator, it isn't. It only matters that the number
> of possible seeds is much larger than the number of samplings we would
> ever choose to perform, and if not 2^64 then definitely 2^128 will be
> sufficient for all eternity.
> 
> (Maybe some theoretical common ground is needed. A random variable is a
> function X from a measurable space with probability measure P to
> another measurable space such that X^{-1}(B) is measurable for every
> measurable B in the range of X. We can define a probability measure on
> the range of X by Q(B) = P(X^{-1}(B)), called the distribution of X.
> 
> Usually the domain of X is taken to be unspecified and uncountable,
> but since in practice we only ever sample a finite subset of it - and
> since we cannot calculate the inverse of X at arbitrary points - we
> can replace the domain with a suitably large finite set with discrete
> probability measure and still get good sample coverage, independently
> of the size of the range of X. This new domain is the seed domain.)
> 
> Phil> Paul's statement:
> Phil> > It's not as easy to implement as it sounds, I'm afraid; even if you
> Phil> > can assume /dev/random or /dev/urandom.
> Phil> 
> Phil> is a slight concern, but I imagine the problems are applicable to *any*
> Phil> shuffling utility. Is it that the app must guarantee all lines of a
> Phil> non-seekable stdin must have an equal chance of any sort order?
> 
> See my comment to James above. I think one need not make this
> guarantee, since only a tiny fraction of possible sort orders will be
> able to be tried by the user. However, I think it would be true for my
> proposal (and the one that James suggested to Davis) if one took the
> random number generator or hash function to be variable or unspecified
> during analysis of the algorithm. The fact that we ship with a
> particular RNG or hash function (of all uncountable possibilities,
> some unimplementable) hard-coded is not a problem since its resources
> are never exhausted.
> 
> Jim> It looks like there are some desirable features that can be
> Jim> provided only by a shuffle-enabled program that is key-aware.
> Jim> Key specification and the comparison code are already part of sort.
> Jim> Obviously, duplicating all of that in a separate program is not
> Jim> an option.  I don't relish the idea of factoring out sort's line-
> Jim> and key-handling code either, but it might be feasible.
> Jim> 
> Jim> However, I do like the idea of a new program that simply outputs
> Jim> a random permutation of its input records, and that does it well,
> Jim> and repeatably.  The Unix tool philosophy certainly does encourage
> Jim> the `perform one task and do it well' approach.  Since doing it
> Jim> well includes handling input larger than available virtual memory,
> Jim> this is not trivial -- and it is well suited to the coreutils,
> Jim> i.e., it's not easily implementable as a script.
> Jim> 
> Jim> Initially, I was inclined to say that adding both the new program
> Jim> (no key support) and related functionality to sort was desirable.
> Jim> Thinking of the limits of robustness of such a new program, I
> Jim> realized that if the input is sufficiently large and not seekable
> Jim> (e.g., from a pipe), then the program will have to resort to writing
> Jim> temporary files, much as sort already does.  More duplicated effort,
> Jim> determining how much memory to use (like sort's --buffer-size=SIZE
> Jim> option), managing the temporary files, ensuring that they're removed
> Jim> upon interrupt, etc.  But maybe not prohibitive.  The new program
> Jim> would also have to have an option like sort's -z, --zero-terminated
> Jim> option, and --temporary-directory=DIR, and --output=FILE.  In effect,
> Jim> it would need all of sort's options that don't relate to sorting.
> 
> I did have some sympathy for Davis' point that a specialized 'shuffle'
> can be implemented much more efficiently than one based on an
> algorithm tuned to mergesort... Although I don't see a problem with
> two utilities if someone wants to write the second one, and it is
> correct, I would tend to agree with you and prefer a key-aware version
> as a priority.
> 
> However, I just realized that the algorithm James suggested to Davis
> is wrong for the same reason that the original random patch I
> responded to is wrong:
> 
> http://lists.gnu.org/archive/html/bug-coreutils/2005-01/msg00145.html
> 
> I'm sorry for not noticing this earlier.
> 
> One possibility for an efficient random permutation of a large file is
> a program which scans the file once and records the location of each
> line in an index, then applies a uniformly distributed random
> permutation to this line index by stepping through it in order and
> swapping the current entry with an entry chosen at random from the set
> of later entries, and finally steps through the index again and
> dereferences each line entry and prints out the corresponding line in
> the original file.
> 
> Note that there is a randomize-lines package in Debian by Arthur de
> Jong, I've cc'ed him. It's much faster than 'sort' on small files, but
> doesn't handle large files at all.
> 
> Jim> So implementing a robust shuffle program, even one without key
> Jim> handling capabilities, would require much of the infrastructure
> Jim> already present in sort.c.
> Jim>
> Jim> It sure sounds like shuffle and sort should share a lot of code,
> Jim> one way or another, so why not have them share the line- and key-
> Jim> handling code, too?  I won't rule out adding a new program, like
> Jim> shuffle, but I confess I'm less inclined now than when I started
> Jim> typing this message.
> 
> Well, it would only be line-handling code, and whatever else, but no
> key-handling code since Davis' shuffle wouldn't be key-aware, unless
> you're talking about my proposal. And maybe common things could be put
> in a library, I don't know. I am against duplication but not against
> change.
> 
> Frederik
> 

-- 
http://ofb.net/~frederik/


_______________________________________________
Bug-coreutils mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Re: new coreutil? shuffle - randomize file contents

Reply via email to