[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2021-07-02 Thread Oscar Benjamin
Oscar Benjamin added the comment: I was contacted by someone interested in this so I've posted the last version above as a GitHub gist under the MIT license: https://gist.github.com/oscarbenjamin/4c1b977181f34414a425f68589e895d1 -- ___ Python

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-19 Thread Oscar Benjamin
Oscar Benjamin added the comment: Yeah, I guess it's a YAGNI. Thanks Raymond and Tim for looking at this! -- ___ Python tracker ___

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-19 Thread Raymond Hettinger
Raymond Hettinger added the comment: > I agree that this could be out of scope for the random module > but I wanted to make sure the reasons were considered. I think we've done that. Let's go ahead and close this one down. In general, better luck can be had by starting with a common real

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-18 Thread Tim Peters
Tim Peters added the comment: The lack of exactness (and possibility of platform-dependent results, including, e.g., when a single platform changes its math libraries) certainly works against it. But I think Raymond is more bothered by that there's no apparently _compelling_ use case, in

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-18 Thread Oscar Benjamin
Oscar Benjamin added the comment: > Please don't get personal. Sorry, that didn't come across with the intended tone :) I agree that this could be out of scope for the random module but I wanted to make sure the reasons were considered. Reading between the lines I get the impression that

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-17 Thread Raymond Hettinger
Raymond Hettinger added the comment: > This comment suggest that you have missed the general > motivation for reservoir sampling. Please don't get personal. I've devoted a good deal of time thinking about your proposal. Tim is also giving it an honest look. Please devote some time to

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-17 Thread Oscar Benjamin
Oscar Benjamin added the comment: > At its heart, this a CPython optimization to take advantage of list() being > slower than a handful of islice() calls. This comment suggest that you have missed the general motivation for reservoir sampling. Of course the stdlib can not satisfy all use

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-17 Thread Raymond Hettinger
Raymond Hettinger added the comment: More thoughts: * If sample_iter() were added, people would expect a choices_iter() as well. * Part of the reason that Set support was being dropped from sample() is that it was rarely used and that it was surprising that it was a O(n) operation instead

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-17 Thread Raymond Hettinger
Raymond Hettinger added the comment: I've put more thought into the proposal and am going to recommend against it. At its heart, this a CPython optimization to take advantage of list() being slower than a handful of islice() calls. It also gains a speed benefit by dropping the antibias

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-17 Thread Raymond Hettinger
Change by Raymond Hettinger : -- versions: +Python 3.10 ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-17 Thread Tim Peters
Tim Peters added the comment: Julia's randsubseq() doesn't allow to specify the _size_ of the output desired. It picks each input element independently with probability p, and the output can be of any size from 0 through the input's size (with mean output length p*length(A)). Reservoir

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-17 Thread Raymond Hettinger
Raymond Hettinger added the comment: Other implementations aren't directly comparable, but I thought I would check to see what others were doing: * Scikit-learn uses reservoir sampling but only when k / n > 0.99. Also, it requires a follow-on step to shuffle the selections. * numpy does

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-17 Thread Oscar Benjamin
Oscar Benjamin added the comment: All good points :) Here's an implementation with those changes and that shuffles but gives the option to preserve order. It also handles the case W=1.0 which can happen at the first step with probability 1 - (1 - 2**53)**k. Attempting to preserve order

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-16 Thread Tim Peters
Tim Peters added the comment: Thanks! That explanation really helps explain where "geometric distribution" comes from. Although why it keeps taking k'th roots remains a mystery to me ;-) Speaking of which, the two instances of exp(log(random())/k) are numerically suspect. Better written as

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-16 Thread Oscar Benjamin
Oscar Benjamin added the comment: To be clear I suggest that this could be a separate function from the existing sample rather than a replacement or a routine used internally. The intended use-cases for the separate function are: 1. Select from something where you really do not want to

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-16 Thread Tim Peters
Tim Peters added the comment: Pro: focus on the "iterable" part of the title. If you want to, e.g., select 3 lines "at random" out of a multi-million-line text file, this kind of reservoir sampling allows to do that holding no more than one line in memory simultaneously. Materializing an

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-15 Thread Raymond Hettinger
Raymond Hettinger added the comment: Thanks for the suggestion. I'll give it some thought over the next few days. Here are a few initial thoughts: * The use of islice() really helps going through a small population quickly. * The current sample() uses _randbelow() instead of random() to

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-15 Thread Karthikeyan Singaravelan
Change by Karthikeyan Singaravelan : -- nosy: +rhettinger ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

2020-07-15 Thread Oscar Benjamin
New submission from Oscar Benjamin : The random.choice/random.sample functions will only accept a sequence to select from. Can there be a function in the random module for selecting from an arbitrary iterable? It is possible to make an efficient function that can make random selections from