Oscar Benjamin <[email protected]> added the comment:
To be clear I suggest that this could be a separate function from the existing
sample rather than a replacement or a routine used internally.
The intended use-cases for the separate function are:
1. Select from something where you really do not want to build a list and just
want/need to use a single pass. For example in selecting a random line from a
text file it is necessary to read the entire file in any case just to know how
many lines there are. The method here would mean that you could make the
selection in a single pass in O(k) memory. The same can apply to many
situations involving generators etc.
2. Convenient, possibly faster selection from a most-likely small dict/set
(this was the original context from python-ideas).
The algorithm naturally gives something in between the original order or a
randomised order. There are two possibilities for changing that:
a. Call random.shuffle or equivalent either to get the original k items in a
random order or at the end before returning.
b. Preserve the original ordering from the iterable: append all new items and
use a sentinel for removed items (remove sentinels at the end).
Since randomised order does not come naturally and requires explicit shuffling
my preference would be to preserve the original order (b.) because there is no
other way to recover the information lost by shuffling (assuming that only a
single pass is possible). The user can call shuffle if they want.
To explain what "geometrically distributed" means I need to refer to the
precursor algorithm from which this is derived. A simple Python version could
be:
def sample_iter_old(iterable, k=1):
uvals_vals = []
# Associate a uniform (0, 1) with each element:
for uval, val in zip(iter(random, None), iterable):
uvals_vals.append((uval, val))
uvals_vals.sort()
uvals_vals = uvals_vals[:k] # keep the k items with smallest uval
return [val for uval, val in uvals_vals]
In sample_iter_old each element val of the iterable is associated with a
uniform (0, 1) variate uval. At each step we keep the k elements having the
smallest uval variates. This is relatively inefficient because we need to
generate a uniform variate for each element val of the iterable. Most of the
time during the algorithm the new val is simply discarded so sample_iter tries
instead to calculate how many items to discard.
The quantity W in sample_iter is the max of the uvals from sample_iter_old:
W := max(uval, for uval, val in uvals_vals)
A new item from the iterable will be swapped in if its uval is less than W. The
number of items skipped before finding a uval < W is geometrically distributed
with parameter W.
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue41311>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com