[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

Oscar Benjamin Fri, 17 Jul 2020 17:22:24 -0700


Oscar Benjamin <oscar.j.benja...@gmail.com> added the comment:


> At its heart, this a CPython optimization to take advantage of list() being 
> slower than a handful of islice() calls.

This comment suggest that you have missed the general motivation for reservoir 
sampling. Of course the stdlib can not satisfy all use cases so this can be out 
of scope as a feature. It is not the case though that this is a CPython 
optimisation.

The idea of reservoir sampling is that you want to sample from an iterator, you 
only get one chance to iterate over it, and you don't know a priori how many 
items it will yield. The classic example of that situation is reading from a 
text file but in general it maps neatly onto Python's concept of iterators. The 
motivation for generators/iterators in Python is that there are many situations 
where it is better to avoid building a concrete in-memory data structure and it 
can be possible to avoid doing so with appropriately modified algorithms (such 
as this one). 

The core use case for this feature is not sampling from an in-memory data 
structure but rather sampling from an expensive generator or an iterator over a 
file/database. The premise is that it is undesirable or perhaps impossible to 
build a list out of the items of the iterable. In those contexts the 
comparative properties of sample/choices are to some extent irrelevant because 
those APIs can not be used or should be avoided because of their overhead in 
terms of memory or other resources.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue41311>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

Reply via email to