If we get this function (which I would like), the version with k items (default 1) is much better. Some iterators cannot be repeated at all, so not only is it slower to call multiple times if you need k>1, it's impossible.
On Mon, Jul 13, 2020, 8:37 AM David Mertz <me...@gnosis.cx> wrote: > This is an inefficient reservoir sampling. The optimized version does not > need to call a random inclusion switch on every element, but can skip a > geometrically ordered collection of (random) skip lengths to avoid most > random inclusion decisions. > > Obviously, all items must be iterated over no matter what, but if > randrange() is significant compared to the cost of next(), the > skip-then-decide version is much preferred, especially as size grows. > > On Mon, Jul 13, 2020, 7:53 AM Oscar Benjamin <oscar.j.benja...@gmail.com> > wrote: > >> I posted this in the thread about indexing dict items but it seems to >> have been buried there so I'm starting a new thread. >> >> Maybe there could be a function in the random module for making a >> random choice from an arbitrary (finite) iterable. This implementation >> can choose a random element from an iterable by fully iterating over >> it so is O(N) in terms of CPU operations but O(1) for memory usage: >> >> # rnd.py >> from random import randrange >> >> def choice_iter(iterable): >> iterator = iter(iterable) >> try: >> value = next(iterator) >> except StopIteration: >> raise ValueError("Empty iterable") >> for n, candidate in enumerate(iterator, 2): >> if not randrange(n): >> value = candidate >> return value >> >> You could use that to get a choice from a dict, set etc: >> >> >>> choice_iter({'q', 'w', 'e'}) >> 'e' >> >>> choice_iter({'q', 'w', 'e'}) >> 'q' >> >>> choice_iter({'q', 'w', 'e'}) >> 'e' >> >>> choice_iter({'q', 'w', 'e'}) >> 'w' >> >> You can make a random choice from the keys, values or items of a dict. >> For a large dict/set it will be slow compared to converting to a list >> because it's calling randrange once per item. It can probably be made >> a lot faster by doing something like calling randbits and extracting >> the bits for multiple iterations of the loop. >> >> Although choice_iter is slower than choice it works for arbitrary >> iterables and has constant memory requirements when used with a lazy >> iterator. There are situations in which you would not want to build a >> list and use random.choice and where the computational overhead is >> insignificant. For example choice_iter can be used to randomly choose >> a line from an arbitrarily large text file without loading it entirely >> into memory: >> >> >>> from rnd import choice_iter >> >>> with open('rnd.py') as fin: >> ... line = choice_iter(fin) >> ... >> >>> line >> ' except StopIteration:\n' >> >> When reading from a large text file this is roughly optimal in >> performance terms because it does precisely the minimum amount of IO >> (one pass) while having constant memory overhead. Of course if you >> wanted to select multiple lines from the file then calling choice_iter >> repeatedly is immediately suboptimal because it would read the file >> more than once. It's not hard though to generalise the function above >> to something that e.g. makes a selection of k values from an iterable >> in a single pass. Here is a version that works without replacement: >> >> def sample_iter(population, k): >> iterator = iter(population) >> values = [] >> for _ in range(k): >> try: >> value = next(iterator) >> except StopIteration: >> raise ValueError("Too few items") >> values.append(value) >> for n, candidate in enumerate(iterator, k+1): >> random_index = randrange(n) >> if random_index < k: >> values[random_index] = candidate >> return values # maybe shuffle first? >> >> The memory cost of this is O(k) regardless of the size of the input >> iterable. >> >> -- >> Oscar >> _______________________________________________ >> Python-ideas mailing list -- python-ideas@python.org >> To unsubscribe send an email to python-ideas-le...@python.org >> https://mail.python.org/mailman3/lists/python-ideas.python.org/ >> Message archived at >> https://mail.python.org/archives/list/python-ideas@python.org/message/4OZTRD7FLXXZ6R6RU4BME6DYR3AXHOBD/ >> Code of Conduct: http://python.org/psf/codeofconduct/ >> >
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/FLXVRXRPZCIAFYD4UIXV4Y3LSFBZZKTQ/ Code of Conduct: http://python.org/psf/codeofconduct/