[issue33114] random.sample() behavior is unexpected/unclear from docs

2018-03-26 Thread Raymond Hettinger

Raymond Hettinger  added the comment:

>  However, let me argue my case one more time, and if you still disagree, feel 
> free to close this.

Thank you for expressing your ideas so clearly.  However, I'm going to go with 
Tim's recommendation and leave the docs as is.

At this point, you already have a complete understanding of what the tool does 
and how to use it, so there isn't anything at stake other than a feeling a 
vindication for a bug that is already resolved.  Looking to the needs of other 
users, I think we're best off not even bringing up non-guaranteed 
implementation details and instead follow the example of Numpy and R.  Thanks 
again for the suggestion, but we'll respectfully decline this time around.

--
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33114] random.sample() behavior is unexpected/unclear from docs

2018-03-26 Thread Scott Eilerman

Scott Eilerman  added the comment:

Raymond, Tim, thanks for your replies so far. I understand (and for the most 
part, agree with) your points about not being able to list every behavior, and 
not wanting to cause uncertainty in users. However, let me argue my case one 
more time, and if you still disagree, feel free to close this.

1. It is expected (in fact, one might argue it's the entire point) that 
initializing random.seed() with a fixed value will produce a repeatable set of 
results for a traditional random number generator. An user expects that calling 
the following should always produce the same sequence of numbers:

random.seed(22)
random.random()
random.random()
random.random()

2. Based on that behavior for one of the most typical/traditional functions in 
the random module, a naive user (me) might assume that random.sample() is 
drawing from its population in a similar manner (i.e. that sequence of returned 
items, regardless of how many you ask the function to return, is uniquely 
determined by the seed). While this is certainly an assumption...

2a. This assumption is somewhat validated by the introductory section of the 
random module docs, which states "Almost all module functions depend on the 
basic function random()..."

2b. More importantly, an user can "validate" this assumption by doing some 
simple tests, e.g.:

choices = range(100)
random.seed(22)
random.sample(choices,1)
random.seed(22)
random.sample(choices,2)
random.seed(22)
random.sample(choices,3)
... and so on

Because of the nature of the set/list optimization, it is VERY possible that an 
user could do due diligence in testing like this (a few different seeds, a few 
different sets of "choices", testing up to k=10) and never uncover the 
problematic behavior. You'd pretty much have to set up some loops like I did 
earlier in this thread, which I don't think many users would do unless the 
expect to find a problem. Even then, with certain selections of "choices", you 
might still get the "expected" results.

2c. If you suspected a problem, or really wanted to be sure the function does 
what you assume it will do, obviously you can open up random.py and take a 
look. However, I doubt many users do this for every built-in module and 
function they use; clearly the point of documentation is to avoid this scenario.

3. As Raymond mentioned, this does not appear to be a "common" problem, and 
perhaps that is enough to not add anything to the docs. However, due to the 
somewhat elusive nature of the behavior, it could certainly go undetected in 
many cases, potentially causing problems without anyone noticing. Perhaps I 
chose a very unorthodox implementation to get the results I desired; I easily 
could have used random.shuffle() or random.sample(pop, len(pop)) and picked the 
nth element. However, one could imagine cases in which you have a very large 
population and you want to optimize by using sample() to get the nth random 
draw rather than randomizing the entire list, so I don't think it's an entirely 
unjustified approach.

4. Given the above points, I'd argue that a one-line insertion into the docs 
would help users steer clear of a hard-to-anticipate, potentially costly 
pitfall. My suggested language is a more direct identification of the possible 
consequences, though I agree that it it perhaps too worry-inducing without 
specifying the "cause" of the problem. Raymond's algorithmic note may be a 
better choice and would have been enough of an indicator for me to avoid the 
mistake I made.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33114] random.sample() behavior is unexpected/unclear from docs

2018-03-25 Thread Tim Peters

Tim Peters  added the comment:

There's nothing in the docs I can see that implies `sample(x, n)` is a prefix 
of what `sample(x, n+1)` would have returned had the latter been called 
instead.  If so, then - as always - it's "at your own risk" when you rely on 
behaviors that haven't been specified.  Since the universe of _possible_ 
behaviors that haven't been specified is immense, I'd rather we didn't even 
start to list all of them ;-)

Seriously, when the docs don't promise something, it's usually a mistake to 
assume they just forgot to mention it.  I'd leave the docs alone in this case.

--
nosy: +tim.peters

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33114] random.sample() behavior is unexpected/unclear from docs

2018-03-23 Thread Raymond Hettinger

Raymond Hettinger  added the comment:

> Something along the lines of: "For a fixed seed, random.sample(population, k)
> is not guaranteed to return the same samples for different values of k."

In a way, the proposed wording succinctly directly addresses the problem you 
had.  So, it would seem like a reasonable suggestion.  On the other hand, it 
would be easy for others who haven't had this problem to have a hard time 
figuring out what it means (when should they be worried, what should be 
avoided, why is it a concern at all, what to do about it).

In general, the docs are worded in an affirmative manner (here's what something 
does, here's what it is for, and here is how to use it correctly).  In this 
case, the docs already indicate the intended way to address this use case: "the 
resulting list is in selection order so that all sub-slices will also be valid 
random samples.  This allows raffle winners (the sample) to be partitioned into 
grand prize and second place winners (the subslices)."

Perhaps there could be an algorithmic note, "internally, sample() shifts 
selection algorithms depending on the proportion of the population being 
sampled".  However, this would be unusual -- we don't usually document 
implementation details.  Numpy[1] and R[2] make no mention of the internals.  
Julia[3] does discuss the algorithms but primarily from an efficiency 
point-of-view rather than as a usage note.

Perhaps it may be best to leave this alone rather than adding a note that may 
itself create confusion and worry.  AFAICT, this hasn't come up before in the 
15 year history of random.sample(), not even a StackOverflow question.

[1] 
https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.choice.html
[2] https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/sample
[3] http://juliastats.github.io/StatsBase.jl/stable/sampling.html#Sampling-API-1

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33114] random.sample() behavior is unexpected/unclear from docs

2018-03-21 Thread Scott Eilerman

Scott Eilerman  added the comment:

Sorry, there's a typo in that example. It should look like:


random.seed(fixed_seed)
random.sample(choices, n)[-1]

Then, later, I want the next draw, so I did:

random.seed(fixed_seed)
random.sample(choices, n+1)[-1]

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33114] random.sample() behavior is unexpected/unclear from docs

2018-03-21 Thread Scott Eilerman

Scott Eilerman  added the comment:

To clarify the use case where this behavior was problematic for me, I wanted to 
get the nth random draw from a given distribution, so I used something like:

random.seed(fixed_seed)
random.sample(choices, n)[-1]

Then, later, I want the next draw, so I did:

random.seed(fixed_seed)
random.sample(choices, n)[-1]


The workaround would be to use random.shuffle and pick the nth item, or use 
random.sample once with the highest k value I expect to need and then pick the 
nth element from that list. But I settled on the above implementation because 
for the first few cases I tested, it returned the results I expected, and 
nothing in the docs suggested I should expect otherwise.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33114] random.sample() behavior is unexpected/unclear from docs

2018-03-21 Thread Scott Eilerman

Scott Eilerman  added the comment:

Something along the lines of: "For a fixed seed, random.sample(population, k) 
is not guaranteed to return the same samples for different values of k."

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33114] random.sample() behavior is unexpected/unclear from docs

2018-03-21 Thread Raymond Hettinger

Raymond Hettinger  added the comment:

What additional wording do you propose?

--
assignee: docs@python -> rhettinger
nosy: +rhettinger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33114] random.sample() behavior is unexpected/unclear from docs

2018-03-21 Thread Scott Eilerman

New submission from Scott Eilerman :

I ran into a "bug" when using random.sample() in which I got some results I 
didn't expect. After digging a little more, this is either a side effect of  
the optimization that's made when k > 5, or I am using the function in a way 
that wasn't intended. If that's the case, I would recommend calling out this 
behavior in the documentation.

The crux of the issue is that, for a given seed, random.sample(choices,k) gives 
the same sequence of results for k=1 to k=5, but that sequence can be different 
(for the same seed) at k=6 and higher. From my initial testing this seems to 
only occur when 'choices' has an even length.

Example code to reproduce this issue:

import random
seed = 199
choices = range(-10,12)

for k in range(10):
random.seed(seed)
print(random.sample(choices,k))


Example code to look at many different occurrences of this issue:

import random
choices = range(-10,12)
count = 0
for seed in range(200):
for k in range(8):
random.seed(seed)
seq1 = random.sample(choices, k)
random.seed(seed)
seq2 = random.sample(choices, k+1)
if seq1 != seq2[:-1]:
print(seed)
print(seq1)
print(seq2)
count += 1
print(f'Number of bugged results: {count}/200')


To illustrate the odd/even issue, changing choices to range(-10,11) results in 
zero bugged results.

--
assignee: docs@python
components: Documentation, Library (Lib)
messages: 314201
nosy: Scott Eilerman, docs@python
priority: normal
severity: normal
status: open
title: random.sample() behavior is unexpected/unclear from docs
type: behavior
versions: Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com