Hi,
Two suggestions for "shuf":
1. Offer the option of obtaining entropy from the RDSEED instruction where
this instruction is available. Indeed, it might be best to make RDSEED
the default entropy source whenever possible, or to combine RDSEED with
the current default.
2. Include in the man page a reference to the following article, which
clarifies requirements for equiprobable selection of permutations and
points out pitfalls that often afflict random shufflers:
https://dl.acm.org/doi/pdf/10.1145/3664645
- - - - - - - - - -
End of suggestions. Below are a few additional notes on context and
background.
I took an interest in combinatorial selection a few years back, and I was
surprised to find bad advice and biased algorithms in seemingly
authoritative textbooks and other seemingly trustworthy sources. At least
four distinct kinds of bias (i.e., deviation from equiprobable selection)
appear frequently in random shufflers:
A. Modulo bias ("random_number % M" where M does not divide the range
from which random_number is drawn)
B. Shuffle bias (using a biased algorithm instead of
Fisher-Yates/Durstenfeld)
C. Seed bias (a fixed-length B-bit pseudo-random number generator (PRNG)
seed precludes equiprobable selection in cases where 2^B is smaller than
N!, where N is the number of items to be shuffled; some permutations are
"unreachable")
D. Occupancy bias (even if a B-bit seed contains maximal entropy (i.e., is
chosen equiprobably from the range [0..2^B-1]), the *ensenble* of
PRNG-plus-unbiased-permutation-algorithm does *not* choose a permutation
equiprobably; a statistical-physics-style "occupancy problem" analysis of
the ensemble shows this)
I looked at the "shuf" source code and it appears that shuf avoids biases
A and B: randint_genmax() in lib/randint.c appears to use rejection to do
its job correctly and avoid the modulo bias of naive "R%M", and the random
permutation code in randperm.c appears to use Fisher-Yates/Durstenfeld.
Bravo!
Users can avoid seed bias and occupancy bias (C and D) by feeding
true-random entropy, from RDSEED or some other source, using the
"--random-source=FILE" feature. My first suggestion is to make this
easier for users by offering RDSEED as an explicit option or making it the
default wherever it is available.
I'm worried about relying on /dev/random for entropy. Over the years the
guarantees provided by /dev/random have changed, and its use is
particularly worrisome inside virtual machines. Considerable attention
has been paid to RDSEED, and since it's an unprivileged CPU instruction,
it will deliver high-quality entropy even within a virtual machine.
Perhaps the most prudent approach would be to combine RDSEED with the
current default entropy source; if done carefully, the result would be
entropy at *least* as random as the *most* random entropy source combined.
Overall I'm pleasantly surprised at the quality and competence of the
"shuf" utility; it is superior to many, many shufflers, including many
from seemingly clueful sources.
Thank you for writing and maintaining shuf and other GNU core utilities!
Terence Kelly
tpkelly @ { acm.org, cs.princeton.edu, eecs.umich.edu }
https://dl.acm.org/profile/81100523747