[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

Oscar Benjamin Thu, 16 Jul 2020 15:38:24 -0700


Oscar Benjamin <oscar.j.benja...@gmail.com> added the comment:


To be clear I suggest that this could be a separate function from the existing 
sample rather than a replacement or a routine used internally.

The intended use-cases for the separate function are:

1. Select from something where you really do not want to build a list and just 
want/need to use a single pass. For example in selecting a random line from a 
text file it is necessary to read the entire file in any case just to know how 
many lines there are. The method here would mean that you could make the 
selection in a single pass in O(k) memory. The same can apply to many 
situations involving generators etc.

2. Convenient, possibly faster selection from a most-likely small dict/set 
(this was the original context from python-ideas).

The algorithm naturally gives something in between the original order or a 
randomised order. There are two possibilities for changing that:

a. Call random.shuffle or equivalent either to get the original k items in a 
random order or at the end before returning.

b. Preserve the original ordering from the iterable: append all new items and 
use a sentinel for removed items (remove sentinels at the end).

Since randomised order does not come naturally and requires explicit shuffling 
my preference would be to preserve the original order (b.) because there is no 
other way to recover the information lost by shuffling (assuming that only a 
single pass is possible). The user can call shuffle if they want.

To explain what "geometrically distributed" means I need to refer to the 
precursor algorithm from which this is derived. A simple Python version could 
be:


def sample_iter_old(iterable, k=1):
    uvals_vals = []
    # Associate a uniform (0, 1) with each element:
    for uval, val in zip(iter(random, None), iterable):
        uvals_vals.append((uval, val))
        uvals_vals.sort()
        uvals_vals = uvals_vals[:k]   # keep the k items with smallest uval
    return [val for uval, val in uvals_vals]


In sample_iter_old each element val of the iterable is associated with a 
uniform (0, 1) variate uval. At each step we keep the k elements having the 
smallest uval variates. This is relatively inefficient because we need to 
generate a uniform variate for each element val of the iterable. Most of the 
time during the algorithm the new val is simply discarded so sample_iter tries 
instead to calculate how many items to discard.

The quantity W in sample_iter is the max of the uvals from sample_iter_old:

    W := max(uval, for uval, val in uvals_vals)

A new item from the iterable will be swapped in if its uval is less than W. The 
number of items skipped before finding a uval < W is geometrically distributed 
with parameter W.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue41311>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

Reply via email to