If we get this function (which I would like), the version with k items
(default 1) is much better. Some iterators cannot be repeated at all, so
not only is it slower to call multiple times if you need k>1, it's
impossible.

On Mon, Jul 13, 2020, 8:37 AM David Mertz <me...@gnosis.cx> wrote:

> This is an inefficient reservoir sampling. The optimized version does not
> need to call a random inclusion switch on every element, but can skip a
> geometrically ordered collection of (random) skip lengths to avoid most
> random inclusion decisions.
>
> Obviously, all items must be iterated over no matter what, but if
> randrange() is significant compared to the cost of next(), the
> skip-then-decide version is much preferred, especially as size grows.
>
> On Mon, Jul 13, 2020, 7:53 AM Oscar Benjamin <oscar.j.benja...@gmail.com>
> wrote:
>
>> I posted this in the thread about indexing dict items but it seems to
>> have been buried there so I'm starting a new thread.
>>
>> Maybe there could be a function in the random module for making a
>> random choice from an arbitrary (finite) iterable. This implementation
>> can choose a random element from an iterable by fully iterating over
>> it so is O(N) in terms of CPU operations but O(1) for memory usage:
>>
>> # rnd.py
>> from random import randrange
>>
>> def choice_iter(iterable):
>>     iterator = iter(iterable)
>>     try:
>>         value = next(iterator)
>>     except StopIteration:
>>         raise ValueError("Empty iterable")
>>     for n, candidate in enumerate(iterator, 2):
>>         if not randrange(n):
>>             value = candidate
>>     return value
>>
>> You could use that to get a choice from a dict, set etc:
>>
>> >>> choice_iter({'q', 'w', 'e'})
>> 'e'
>> >>> choice_iter({'q', 'w', 'e'})
>> 'q'
>> >>> choice_iter({'q', 'w', 'e'})
>> 'e'
>> >>> choice_iter({'q', 'w', 'e'})
>> 'w'
>>
>> You can make a random choice from the keys, values or items of a dict.
>> For a large dict/set it will be slow compared to converting to a list
>> because it's calling randrange once per item. It can probably be made
>> a lot faster by doing something like calling randbits and extracting
>> the bits for multiple iterations of the loop.
>>
>> Although choice_iter is slower than choice it works for arbitrary
>> iterables and has constant memory requirements when used with a lazy
>> iterator. There are situations in which you would not want to build a
>> list and use random.choice and where the computational overhead is
>> insignificant. For example choice_iter can be used to randomly choose
>> a line from an arbitrarily large text file without loading it entirely
>> into memory:
>>
>> >>> from rnd import choice_iter
>> >>> with open('rnd.py') as fin:
>> ...     line = choice_iter(fin)
>> ...
>> >>> line
>> '    except StopIteration:\n'
>>
>> When reading from a large text file this is roughly optimal in
>> performance terms because it does precisely the minimum amount of IO
>> (one pass) while having constant memory overhead. Of course if you
>> wanted to select multiple lines from the file then calling choice_iter
>> repeatedly is immediately suboptimal because it would read the file
>> more than once. It's not hard though to generalise the function above
>> to something that e.g. makes a selection of k values from an iterable
>> in a single pass. Here is a version that works without replacement:
>>
>> def sample_iter(population, k):
>>     iterator = iter(population)
>>     values = []
>>     for _ in range(k):
>>         try:
>>             value = next(iterator)
>>         except StopIteration:
>>             raise ValueError("Too few items")
>>         values.append(value)
>>     for n, candidate in enumerate(iterator, k+1):
>>         random_index = randrange(n)
>>         if random_index < k:
>>             values[random_index] = candidate
>>     return values  # maybe shuffle first?
>>
>> The memory cost of this is O(k) regardless of the size of the input
>> iterable.
>>
>> --
>> Oscar
>> _______________________________________________
>> Python-ideas mailing list -- python-ideas@python.org
>> To unsubscribe send an email to python-ideas-le...@python.org
>> https://mail.python.org/mailman3/lists/python-ideas.python.org/
>> Message archived at
>> https://mail.python.org/archives/list/python-ideas@python.org/message/4OZTRD7FLXXZ6R6RU4BME6DYR3AXHOBD/
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
>
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/FLXVRXRPZCIAFYD4UIXV4Y3LSFBZZKTQ/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to