[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

Raymond Hettinger Wed, 15 Jul 2020 22:16:01 -0700

Raymond Hettinger <raymond.hettin...@gmail.com> added the comment:

Thanks for the suggestion.  I'll give it some thought over the next few days.


Here are a few initial thoughts:

* The use of islice() really helps going through a small population quickly.

* The current sample() uses _randbelow() instead of random() to guarantee 
equidistribution and to eliminate the small biases that come from using 
random().  So this is a step backwards.

* The current sample() guarantees that slices of the sample are all in 
randomized order.  The proposed algorithm doesn't:

    # Fully shuffled
    >>> sample(range(20), k=19)
    >>> [3, 19, 11, 16, 5, 12, 10, 7, 14, 0, 6, 18, 8, 9, 4, 13, 15, 2, 1]

    # Not random at all 
    >>> sample_iter(range(20), k=19)
    >>> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 19, 11, 12, 13, 14, 15, 16, 17, 18]

* The current sample runs in speed proportional to *k* rather than *n*.  This 
means that the proposed algorithm slows down considerably for large populations:

    In [12]: %timeit sample(range(100_000_000), k=20)
    15.7 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

    In [13]: %timeit sample_iter(range(100_000_000), k=20)
    1.59 s ± 9.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

* For an apples-to-apples comparison with choices(), use the version in Python 
3.9 which was updated to use floor() which is several times faster than int().

* Instead of listing the islice object, consider using next() instead.  That 
would likely be faster:

    def sample_iter(iterable, k=1):
        iterator = iter(iterable)
        values = list(islice(iterator, k))
        W = exp(log(random())/k)
        try:
            while True:
                # skip is geometrically distributed
                skip = floor( log(random())/log(1-W) )
                values[randrange(k)] = next(islice(iterator, skip, skip+1))
                W *= exp(log(random())/k)
        except StopIteration:
            return values

* Using mock's call-count feature shows that the proposed algorithm uses more 
entropy that the current sample() code.  It seems that random() and 
_randbelow() are both being called for every selection.  I haven't yet 
investigated to what causes this, but it is unfavorable especially for slow 
generators like SystemRandom().

----------
assignee:  -> rhettinger

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue41311>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

Reply via email to