On 12/28/2013 05:13 PM, Jyothis V wrote: > Hi, thanks for the reply. I understand why something like reservoir sampling > is needed. But in shuf.c, shuffling is done in two steps: 1) using reservoir > sampling, an array of length head_length is obtained. At this stage, the > array is not completely shuffled because the first head_length elements were > copied verbatim. 2) This array is then randomly permuted while writing. My > question is whether these two steps could be clubbed together, just as shown > in the second algorithm given in the wikipedia page you mentioned.
Yes there are more improvements possible in this area. Also we should avoid the more CPU intensive reservoir sampling in the common case of reading small inputs from stdin: http://lists.gnu.org/archive/html/coreutils/2013-03/msg00057.html http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/shuf.c;h=456140#l272 Also reservoir sampling is currently not used with --repeat. But it does seem possible to use reservoir-sampling with replacement: https://www.siam.org/proceedings/datamining/2004/dm04_053parkb.pdf thanks, Pádraig.