Re: Shuf reservoir sampling

Pádraig Brady Sun, 29 Dec 2013 05:10:31 -0800

On 12/28/2013 05:13 PM, Jyothis V wrote:
> Hi, thanks for the reply. I understand why something like reservoir sampling 
> is needed. But in shuf.c, shuffling is done in two steps: 1) using reservoir 
> sampling, an array of length head_length is obtained. At this stage, the 
> array is not completely shuffled because the first head_length elements were 
> copied verbatim. 2) This array is then randomly permuted while writing. My 
> question is whether these two steps could be clubbed together, just as shown 
> in the second algorithm given in the wikipedia page you mentioned.


Yes there are more improvements possible in this area.

Also we should avoid the more CPU intensive reservoir sampling
in the common case of reading small inputs from stdin:
http://lists.gnu.org/archive/html/coreutils/2013-03/msg00057.html
http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/shuf.c;h=456140#l272

Also reservoir sampling is currently not used with --repeat.
But it does seem possible to use reservoir-sampling with replacement:
https://www.siam.org/proceedings/datamining/2004/dm04_053parkb.pdf

thanks,
Pádraig.

Re: Shuf reservoir sampling

Reply via email to