On Wed, Nov 30, 2016 at 12:21 PM, Bernardo Sulzbach
<mafagafogiga...@gmail.com> wrote:
> On 2016-11-30 17:57, Chris Kaynor wrote:
>>
>> On Wed, Nov 30, 2016 at 11:52 AM, Chris Kaynor <ckay...@zindagigames.com>
>> wrote:
>>>
>>> There are also issues with how it should behave on iterables that
>>> cannot be re-iterated (eg, random.choice will consume the iterator,
>>> and could only be called once safely).
>>
>>
>> I meant to include a sample in my previous e-mail:
>>
>> Consider that this code will not produce the "correct" results (for a
>> reasonable definition of correct):
>>
>> a = (i for i in range(100)) # Pretend this does something more
>> interesting, and isn't a trivial generator - maybe a file object
>> reading by line.
>> randomEntries = [random.choice(a) for i in range(10)]
>
>
> In such a case you should explicitly use a sample.
>
> I see your example as the caller's fault, which ignored the fact that the
> iterator would change after calls to choice.
>
> Hold the first 10 (in this case). For every subsequent element, randomly
> choose to replace one of the "held" ones by it (with a diminishing
> probability).
>
> Assume this does not offer the same performance as loading everything into
> memory. But it isn't meant to do so, as if you need / can / want, you could
> just shove it all into a list and use what we currently have.

It would be the caller's fault: they failed to follow the documentation.

That said, it would be a fairly subtle difference, and in many cases
may go unnoticed in simple testing. There is a good probability that
if you needed 10 samples out of 1000 entries, my code would appear to
work correctly (I don't care to do the math to figure out the actual
chance). Only on deeper testing or analysis of the results, would you
notice that the probabilities are messed up. Most likely, my exact
example would be fairly obvious, but imagine if the code were more
complex, with the random.choice call inside a loop, or even another
function.

If random.choice were updated to use a reservoir sampling as a
fallback, it also means that it is more likely for performance to
degrade with some types. Giving a list (and maybe other sequences)
would be O(1), but if a generator gets passed in, it falls back to
O(n), and with lots of random.random calls that are generally
naturally slow anyways.

All that said, I would not be opposed to Python including a
random.reservoir_choice (probably not the best name) function *in
addition* to random.choice. The algorithm has its uses, but enough
drawbacks and gotchas that it likely is not a good candidate for a
fallback.
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to