[issue35094] Improved algorithms for random.sample

Raymond Hettinger Sun, 28 Oct 2018 13:59:41 -0700


Raymond Hettinger <[email protected]> added the comment:


For the common case, where k is small and n is large, reservoir sampling makes 
n calls rather the current k plus a few reselections -- the only cost is a 
temporary k-sized set using superfast int hashing.  This technique works even 
for a very large values of n, such as sample(range(1_000_000_000), k=10_000).

For the case where k is a large percentage of n, the size of the set becomes 
noticeable and reselections would become more common (placing a higher load on 
the underlying RNG and eating its entropy, keep in mind that the underlying RNG 
can be SystemRandom for example).  So, we switch algorithms to a partial 
shuffle, which has no reselections and uses a temporary list of references to 
the sampled objects.   The places the absolute minimum burden on the underlying 
RNG and makes the minimum number of data swaps (neither of those virtues can be 
claimed by reservoir sampling).

The only advantage of the reservoir approach is not requiring auxiliary
storage.  Keep in mind, the temporary list only holds references to the sampled 
data, so tends to be small relative to that data.  It is no more 
disadvantageous than typical applications of list comprehensions which make new 
lists while looping over some other iterable.

In any case, the algorithms prefer to optimize for fewest calls the the random 
number generator rather than aspiring to zero auxiliary storage.  The 
randbelow() calls are not superfast, so we don't want to use many of them.  
Likewise, calls to SystemRandom() eat available entropy,  so we don't want to 
use many of them either.  Also, the Random class is designed to be subclassed 
to allow uses of other RNGs which are likely to have a smaller period that the 
MersenneTwister so we don't want many calls to them either (it eats their 
limited state space just like shuffle() exceeds the capabilities of MT with an 
input size as small as 2081).

Lastly, I have a vaguely held concern that reservoir sampling uses the RNG in a 
way that would magnify any weaknesses in that RNG (by virtues to making more 
calls and by virtue of using selections in the full range from n-k to n), so we 
would get lower quality shuffles.

FWIW, all of these things were considered when shuffle() was designed, it 
wasn't like other methods weren't considered.  The design we have now was 
deemed to be the best fit for most of our users, most of time (we've never 
gotten a complaint about the temporary storage being a problem for any user, 
ever).  I would however expect complaints about an increased number of calls to 
the user's rngs.

----------

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue35094>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue35094] Improved algorithms for random.sample

Reply via email to