Re: [Python-ideas] Allow random.choice, random.sample to work on iterators

2016-12-02 Thread Terry Reedy

On 12/1/2016 3:19 AM, Sjoerd Job Postmus wrote:

On Wed, Nov 30, 2016 at 02:32:54PM -0600, Nick Timkovich wrote:

a generator with known length that's not indexable (a rare beast?).


I don't believe a generator is ever indexable.


Not as rare as you might think:


k = set(range(10))
len(k)

10

k[3]

Traceback (most recent call last):
  File "", line 1, in 
  TypeError: 'set' object does not support indexing


It is also not a generator.  (It is an iterable.).  If an *arbitrary* 
choice (without replacement) from a set is sufficient, set.pop() works. 
Otherwise, make a list.  If we wanted selection selection from sets to 
be easy, without making a list, we should add a method that accesses the 
internal indexable array.


--
Terry Jan Reedy


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Allow random.choice, random.sample to work on iterators

2016-12-01 Thread Sjoerd Job Postmus
On Wed, Nov 30, 2016 at 02:32:54PM -0600, Nick Timkovich wrote:
> a generator with known length that's not indexable (a rare beast?).

Not as rare as you might think:

>>> k = set(range(10))
>>> len(k)
10
>>> k[3]
Traceback (most recent call last):
  File "", line 1, in 
  TypeError: 'set' object does not support indexing
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Allow random.choice, random.sample to work on iterators

2016-11-30 Thread Steven D'Aprano
On Wed, Nov 30, 2016 at 11:57:46AM -0800, Chris Kaynor wrote:

> Consider that this code will not produce the "correct" results (for a
> reasonable definition of correct):
> 
> a = (i for i in range(100)) # Pretend this does something more
> interesting, and isn't a trivial generator - maybe a file object
> reading by line.
> randomEntries = [random.choice(a) for i in range(10)] # May not be
> quite as obvious, such as the choices could be chosen in a more
> complex loop.
> 
> randomEntries may not contain 10 items (or the generation may error
> due to having insufficent items, depending on implementation),

Indeed. Given a iterator with 100 items, there's a 9% chance that the 
first choice will be in the last nine items, which implies that failures 
here will be common.


> it will
> not contain duplicates, and will be sorted.

Right. In other words, its a highly biased, non-random "random sample".

It may be worth adding a "reservoir sample" function, but I think it is 
a mistake to try modifying the existing functions to deal with 
iterators. Its too hard to get the same characteristics when sampling 
from a sequence and an iterator. Better to keep them as separate 
functions so that the caller knows what they're getting.

Here's my first attempt at implementing Algorithm R from Wikipedia:

https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_R


from random import randrange, shuffle
import itertools

def reservoir_sample(iterable, count=1):
"""Return a list of count items sampled without replacement
from iterable, using reservoir sampling "Algorithm R".

This will exhaust the iterable, which must be finite.
"""
it = iter(iterable)
reservoir = list(itertools.islice(it, count))
if len(reservoir) < count:
raise ValueError('iterator is too short to sample %d items' % count)
shuffle(reservoir)
for i, value in enumerate(it, count+1):
j = randrange(0, i)
if j < count:
reservoir[j] = value
return reservoir



-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Allow random.choice, random.sample to work on iterators

2016-11-30 Thread Bernardo Sulzbach

On 2016-11-30 19:11, Chris Kaynor wrote:


All that said, I would not be opposed to Python including a
random.reservoir_choice (probably not the best name) function *in
addition* to random.choice. The algorithm has its uses, but enough
drawbacks and gotchas that it likely is not a good candidate for a
fallback.


I think this may be the path of least resistance for this. Even if this 
does imply one or two new functions in random, it may be better than 
changing random.choice.


If these functions would be used enough by enough end users to justify 
this change is debatable, however.


--
Bernardo Sulzbach
http://www.mafagafogigante.org/
mafagafogiga...@gmail.com
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Allow random.choice, random.sample to work on iterators

2016-11-30 Thread Chris Kaynor
On Wed, Nov 30, 2016 at 12:21 PM, Bernardo Sulzbach
 wrote:
> On 2016-11-30 17:57, Chris Kaynor wrote:
>>
>> On Wed, Nov 30, 2016 at 11:52 AM, Chris Kaynor 
>> wrote:
>>>
>>> There are also issues with how it should behave on iterables that
>>> cannot be re-iterated (eg, random.choice will consume the iterator,
>>> and could only be called once safely).
>>
>>
>> I meant to include a sample in my previous e-mail:
>>
>> Consider that this code will not produce the "correct" results (for a
>> reasonable definition of correct):
>>
>> a = (i for i in range(100)) # Pretend this does something more
>> interesting, and isn't a trivial generator - maybe a file object
>> reading by line.
>> randomEntries = [random.choice(a) for i in range(10)]
>
>
> In such a case you should explicitly use a sample.
>
> I see your example as the caller's fault, which ignored the fact that the
> iterator would change after calls to choice.
>
> Hold the first 10 (in this case). For every subsequent element, randomly
> choose to replace one of the "held" ones by it (with a diminishing
> probability).
>
> Assume this does not offer the same performance as loading everything into
> memory. But it isn't meant to do so, as if you need / can / want, you could
> just shove it all into a list and use what we currently have.

It would be the caller's fault: they failed to follow the documentation.

That said, it would be a fairly subtle difference, and in many cases
may go unnoticed in simple testing. There is a good probability that
if you needed 10 samples out of 1000 entries, my code would appear to
work correctly (I don't care to do the math to figure out the actual
chance). Only on deeper testing or analysis of the results, would you
notice that the probabilities are messed up. Most likely, my exact
example would be fairly obvious, but imagine if the code were more
complex, with the random.choice call inside a loop, or even another
function.

If random.choice were updated to use a reservoir sampling as a
fallback, it also means that it is more likely for performance to
degrade with some types. Giving a list (and maybe other sequences)
would be O(1), but if a generator gets passed in, it falls back to
O(n), and with lots of random.random calls that are generally
naturally slow anyways.

All that said, I would not be opposed to Python including a
random.reservoir_choice (probably not the best name) function *in
addition* to random.choice. The algorithm has its uses, but enough
drawbacks and gotchas that it likely is not a good candidate for a
fallback.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Allow random.choice, random.sample to work on iterators

2016-11-30 Thread Nick Timkovich
Is the goal to allow them to consume a finite generator of *unknown* length
(requires reservoir sampling
https://en.wikipedia.org/wiki/Reservoir_sampling  with N random calls,
which seemed to be the rub before?) or just consume a generator with known
length that's not indexable (a rare beast?). Consuming iterables if they
have a length like below wouldn't be so bad, but might be too niche.

class X:
def __init__(self, ele): self.ele = ele
def __len__(self): return len(self.ele)
def __iter__(self): return iter(self.ele)

x = X([1, 2, 3, 4, 5])
random.choice(x) # TypeError: 'X' object does not support indexing

Would allowing an optional 'len' argument alongside the iterator to
sample/choice be too narrow to be useful?

On Wed, Nov 30, 2016 at 2:21 PM, Bernardo Sulzbach <
mafagafogiga...@gmail.com> wrote:

> On 2016-11-30 17:57, Chris Kaynor wrote:
>
>> On Wed, Nov 30, 2016 at 11:52 AM, Chris Kaynor 
>> wrote:
>>
>>> There are also issues with how it should behave on iterables that
>>> cannot be re-iterated (eg, random.choice will consume the iterator,
>>> and could only be called once safely).
>>>
>>
>> I meant to include a sample in my previous e-mail:
>>
>> Consider that this code will not produce the "correct" results (for a
>> reasonable definition of correct):
>>
>> a = (i for i in range(100)) # Pretend this does something more
>> interesting, and isn't a trivial generator - maybe a file object
>> reading by line.
>> randomEntries = [random.choice(a) for i in range(10)]
>>
>
> In such a case you should explicitly use a sample.
>
> I see your example as the caller's fault, which ignored the fact that the
> iterator would change after calls to choice.
>
> Hold the first 10 (in this case). For every subsequent element, randomly
> choose to replace one of the "held" ones by it (with a diminishing
> probability).
>
> Assume this does not offer the same performance as loading everything into
> memory. But it isn't meant to do so, as if you need / can / want, you could
> just shove it all into a list and use what we currently have.
>
> --
> Bernardo Sulzbach
> http://www.mafagafogigante.org/
> mafagafogiga...@gmail.com
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Allow random.choice, random.sample to work on iterators

2016-11-30 Thread Bernardo Sulzbach

On 2016-11-30 17:57, Chris Kaynor wrote:

On Wed, Nov 30, 2016 at 11:52 AM, Chris Kaynor  wrote:

There are also issues with how it should behave on iterables that
cannot be re-iterated (eg, random.choice will consume the iterator,
and could only be called once safely).


I meant to include a sample in my previous e-mail:

Consider that this code will not produce the "correct" results (for a
reasonable definition of correct):

a = (i for i in range(100)) # Pretend this does something more
interesting, and isn't a trivial generator - maybe a file object
reading by line.
randomEntries = [random.choice(a) for i in range(10)]


In such a case you should explicitly use a sample.

I see your example as the caller's fault, which ignored the fact that 
the iterator would change after calls to choice.


Hold the first 10 (in this case). For every subsequent element, randomly 
choose to replace one of the "held" ones by it (with a diminishing 
probability).


Assume this does not offer the same performance as loading everything 
into memory. But it isn't meant to do so, as if you need / can / want, 
you could just shove it all into a list and use what we currently have.


--
Bernardo Sulzbach
http://www.mafagafogigante.org/
mafagafogiga...@gmail.com
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Allow random.choice, random.sample to work on iterators

2016-11-30 Thread Chris Kaynor
On Wed, Nov 30, 2016 at 11:52 AM, Chris Kaynor  wrote:
> There are also issues with how it should behave on iterables that
> cannot be re-iterated (eg, random.choice will consume the iterator,
> and could only be called once safely).

I meant to include a sample in my previous e-mail:

Consider that this code will not produce the "correct" results (for a
reasonable definition of correct):

a = (i for i in range(100)) # Pretend this does something more
interesting, and isn't a trivial generator - maybe a file object
reading by line.
randomEntries = [random.choice(a) for i in range(10)] # May not be
quite as obvious, such as the choices could be chosen in a more
complex loop.

randomEntries may not contain 10 items (or the generation may error
due to having insufficent items, depending on implementation), it will
not contain duplicates, and will be sorted. This occurs because the
first call to random.choice will consume elements from a until it
picks one. The next call then consumes from there until it picks one,
and so forth.

Chris
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Allow random.choice, random.sample to work on iterators

2016-11-30 Thread Chris Kaynor
This was also brought up back in April:
https://mail.python.org/pipermail//python-ideas/2016-April/039707.html

It got a few replies from Guido
(https://mail.python.org/pipermail//python-ideas/2016-April/039713.html
for one of them).

It seems the idea got dropped due to problems with making it properly
random (practically, the sampling has issues in cases of very large
sequences, let along infinite) and performance (either large numbers
of calls to random, or copying the sequence, each of which has its own
problems).

There are also issues with how it should behave on iterables that
cannot be re-iterated (eg, random.choice will consume the iterator,
and could only be called once safely).
Chris


On Wed, Nov 30, 2016 at 11:42 AM, Bernardo Sulzbach
 wrote:
> On 2016-11-30 17:25, Random832 wrote:
>>
>> Currently these functions fail if the supplied object has no len().
>> There are algorithms for this task that can work on any finite iterator
>> (for example, files as a stream of lines), and the functions could fall
>> back to these if there is no len().
>
>
> I like the idea, as long as it does not add too much overhead to currently
> existing code.
>
> It could be a special code path for reservoir sampling (I assume) for both
> functions (the first taking only one sample from the stream).
>
>
> --
> Bernardo Sulzbach
> http://www.mafagafogigante.org/
> mafagafogiga...@gmail.com
>
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Allow random.choice, random.sample to work on iterators

2016-11-30 Thread Bernardo Sulzbach

On 2016-11-30 17:25, Random832 wrote:

Currently these functions fail if the supplied object has no len().
There are algorithms for this task that can work on any finite iterator
(for example, files as a stream of lines), and the functions could fall
back to these if there is no len().


I like the idea, as long as it does not add too much overhead to 
currently existing code.


It could be a special code path for reservoir sampling (I assume) for 
both functions (the first taking only one sample from the stream).



--
Bernardo Sulzbach
http://www.mafagafogigante.org/
mafagafogiga...@gmail.com
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Allow random.choice, random.sample to work on iterators

2016-11-30 Thread Random832
Currently these functions fail if the supplied object has no len().
There are algorithms for this task that can work on any finite iterator
(for example, files as a stream of lines), and the functions could fall
back to these if there is no len().
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/