Raymond Hettinger <raymond.hettin...@gmail.com> added the comment: Thanks for the suggestion. I'll give it some thought over the next few days.
Here are a few initial thoughts: * The use of islice() really helps going through a small population quickly. * The current sample() uses _randbelow() instead of random() to guarantee equidistribution and to eliminate the small biases that come from using random(). So this is a step backwards. * The current sample() guarantees that slices of the sample are all in randomized order. The proposed algorithm doesn't: # Fully shuffled >>> sample(range(20), k=19) >>> [3, 19, 11, 16, 5, 12, 10, 7, 14, 0, 6, 18, 8, 9, 4, 13, 15, 2, 1] # Not random at all >>> sample_iter(range(20), k=19) >>> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 19, 11, 12, 13, 14, 15, 16, 17, 18] * The current sample runs in speed proportional to *k* rather than *n*. This means that the proposed algorithm slows down considerably for large populations: In [12]: %timeit sample(range(100_000_000), k=20) 15.7 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [13]: %timeit sample_iter(range(100_000_000), k=20) 1.59 s ± 9.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) * For an apples-to-apples comparison with choices(), use the version in Python 3.9 which was updated to use floor() which is several times faster than int(). * Instead of listing the islice object, consider using next() instead. That would likely be faster: def sample_iter(iterable, k=1): iterator = iter(iterable) values = list(islice(iterator, k)) W = exp(log(random())/k) try: while True: # skip is geometrically distributed skip = floor( log(random())/log(1-W) ) values[randrange(k)] = next(islice(iterator, skip, skip+1)) W *= exp(log(random())/k) except StopIteration: return values * Using mock's call-count feature shows that the proposed algorithm uses more entropy that the current sample() code. It seems that random() and _randbelow() are both being called for every selection. I haven't yet investigated to what causes this, but it is unfavorable especially for slow generators like SystemRandom(). ---------- assignee: -> rhettinger _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue41311> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com