[issue39867] randrange(N) for N's in same dyadic blocs have excessive correlations when sharing identical seeds

2020-03-06 Thread Tim Peters


Tim Peters  added the comment:

I've been involved - usually very lightly, but sometimes deeply - with PRNGs 
for over 40 years.  I've never seen anyone raise the kinds of concerns you're 
raising here, and, frankly, they don't make sense to me.

But on the chance that I've missed recent fundamental developments, I did some 
searching.  The state of the art for uniform selection from a range appears to 
be what the Go language recently adopted, shown as Algorithm 5 in this paper 
(which also surveys what other languages do):

https://arxiv.org/pdf/1805.10941.pdf

Nobody gives a rip about "correlations" when reusing an internal state.  What 
they're aiming at is a combination of "no bias" and "peak speed".

Python's job is harder, because we couldn't care less what the native machine 
word size is.  For example, randrange(10**500) works fine.  Approaches based on 
chopping back multiplication with a "random" C double can't get more than 53 
bits (on most machines), the limit of IEEE-754 format double precision.  Python 
dropped that approach not only because it was slightly biased, but also because 
it didn't scale to unbounded ranges.

The Go approach works like this Python code, where NBITS would be hard coded in 
C to 32 or 64, depending on the machine word size.  Its advantage is that it 
_usually_ needs no arbitrary divisions (implied by "%") at all, just masking 
and shifting.  At worst, it _may_ need one for-real division.  But it's built 
to work with native machine integer precision, and requires full double-width 
integer multiplication:

from random import randrange
NBITS = 64 # need NBITS x NBITS full-precision multiply
POWER = 1 << NBITS
MASK = POWER - 1

def rr(s):  # uniform random int in range(s)
assert 0 < s <= POWER
x = randrange(POWER) # i.e., a random bitstring with NBITS bits
m = x * s # full double-width product
r = m & MASK
if r < s:
t = (POWER - s) % s  # the expensive line
while r < t:
x = randrange(POWER)
m = x * s
r = m & MASK
return m >> NBITS

In a nutshell, in each [i * POWER, (i+1) * POWER) clopen interval, it rejects 
the first POWER % s values, leaving exactly floor(POWER/s) multiples of s in 
the interval.

Of course that suffers "correlations" too if you reuse the seed - although not 
of the exact microscopic kind you're talking about.

You haven't responded to questions about why that specific test is thought to 
be especially important, or to why you believe you can't do an obvious thing 
instead (use different seeds, or don't reset the seed at all).

As to who's making raw assertions here, the situation isn't symmetric ;-)  If 
you want a change, you have to _make a case_ for it.  And a good one.  Every 
core Python developer who has knowledge of this code has now told you they 
personally don't see any case for it.  So, sorry, but the status quo is always 
favored in the absence of compelling reason to change.  There were compelling 
reasons to switch to the current scheme.  As you said yourself:

"""
And I understand the puzzlement about my test file setting the same random seed 
and then complaining about correlations. 
"""

That was insightful.  The difference is we never got over that puzzlement ;-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39867] randrange(N) for N's in same dyadic blocs have excessive correlations when sharing identical seeds

2020-03-05 Thread jfbu


jfbu  added the comment:

"bug" is a strong word, which I did never employ myself, apart from using this 
channel of report. I rather think of a (non-documented) "deficiency", but I 
expect the consensus will again be that I am expressing a "raw expression". 
However reading more than once that "the correlations are indeed unsurprising" 
it is my turn to see there a raw expression. The correlations are unsurprising 
*only* if one looks at the source code and understand how a (to a very high 
degree) uniform distribution on a power of 2 range is reduced to distribution 
on the smaller range (keeping the extremely high uniformity). Thus, sorry, the 
correlations are to the contrary *very surprising* to the end user who has no 
knowledge of the internals.

Donald Knuth for example many decades ago in his work on MetaPost used a RNG 
which is a kind a primitive ancestor (in the family of those commented upon in 
AOCP) of the much more sophisticated one used nowadays by Python. He used the 
rescaling with rounding method to go from power of 2 range to non power of 2 
range. That method induces some non-uniformity and if I understand (without 
having checked) it was the one from Python < 3.2. At some point Python changed 
its way to another way, to cure non-uniformity (and other artefacts of the 
simple minded rescaling method). But there are other ways to reduce the 
non-uniformity to negligible levels which are not the choice made currently in 
Python code. (which for people not having _randbelow_with_getrandbits before 
their eyes is simply to draw a random integer with at most the number of bits 
of the limit N until one is found at most equal to N).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39867] randrange(N) for N's in same dyadic blocs have excessive correlations when sharing identical seeds

2020-03-05 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

I concur with Tim and Mark.  This isn't a bug.  The randomization occurs at the 
getrandbits() level.  The downstream functions, such as randbelow and 
randrange, have no responsibility to create additional dispersion; instead, 
their role is to map the upstream scattering to the desired target 
distributions.

The correlations found by the OP are indeed unsurprising.  They disappear 
entirely when using two difference seeds or by making successive selections 
from a single stream.

--
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39867] randrange(N) for N's in same dyadic blocs have excessive correlations when sharing identical seeds

2020-03-05 Thread Tim Peters


Tim Peters  added the comment:

This is where you're not getting traction:

"A randrange() function should a priori not be so strongly tied to the binary 
base."

That's a raw assertion.  _Why_ shouldn't it be?  "Because I keep saying so" 
isn't changing minds ;-)

I understand you're looking at exact equality of t-tuples.  I wasn't in my 
example:  I was looking at the individual values, one pair at a time.  The 
extreme correlation is dead obvious by eyeball either way, despite that the 
only test you seem to have in mind (exact equality of t-tuples) is blind to it. 
 Why is that test so important?  Why does it not matter that, e.g., number of 
inversions, number of runs, distribution of run-lengths (etc) remain highly 
correlated regardless?

Nobody else has had a problem with this, and it remains unclear why you do:  
what's your objection to Mark's suggestions (use different seeds, or _don't_ 
reset the seed)?  That's the obvious approach:  use the facilities in 
straightforward ways.

In any case, we can't/won't make changes on a whim.  As far as possible, we 
strive to keep results bit-for-bit identical across releases for people who 
save/set seeds, hoping to get reproducible results.  Changing the results from 
any random module function requires strong justification.

So far, I don't see that here.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39867] randrange(N) for N's in same dyadic blocs have excessive correlations when sharing identical seeds

2020-03-05 Thread jfbu


jfbu  added the comment:

@tim.peters yes, a uniform random variable rescaled to two nearby scales N and 
M will display strong correlations. The CPython randrange() exhibits however 
orders of magnitude higher such correlations, but only in relation to a common 
bitlength. A randrange() function should a priori not be so strongly tied to 
the binary base.

The example you show would not be counted as a hit by my test for the 
randomseed 12.

>>> s = 0
>>> for t in range(10):
... random.seed(t)
... x = [round(random.random() * 100) for i in range(10)]
... random.seed(t)
... y = [round(random.random() * 101) for i in range(10)]
... if x == y:
... s += 1
... 
>>> s
94
>>> s = 0
>>> for t in range(10):
... random.seed(t)
... x = [random.randrange(100) for i in range(10)]
... random.seed(t)
... y = [random.randrange(101) for i in range(10)]
... if x == y:
... s += 1
... 
>>> s
90432

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39867] randrange(N) for N's in same dyadic blocs have excessive correlations when sharing identical seeds

2020-03-05 Thread Tim Peters


Tim Peters  added the comment:

Sorry, I don't see "a problem" here either.  Rounding instead can change the 
precise nature of the correlations if you insist on starting from the same 
seed, but it hardly seems a real improvement; e.g.,

>>> random.seed(12)
>>> [round(random.random() * 100) for i in range(10)]
[47, 66, 67, 14, 1, 37, 27, 81, 69, 60]
>>> random.seed(12)
>>> [round(random.random() * 101) for i in range(10)]
[48, 66, 67, 14, 1, 38, 28, 82, 70, 61]

That is, while there are fewer identical values, the correlation is 
nevertheless obvious and extreme.  Not only not a bug, it's not even surprising 
;-)

--
nosy: +tim.peters

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39867] randrange(N) for N's in same dyadic blocs have excessive correlations when sharing identical seeds

2020-03-05 Thread jfbu


jfbu  added the comment:

Yes indeed the source code of _randbelow_with_getrandbits generates this 
effect. And I understand the puzzlement about my test file setting the same 
random seed and then complaining about correlations. But there is some 
non-uniformity which is enormous between what happens for say n=99 and m=127 
versus n=99 and m=128.

Now that I looked at the actual source code, I have in other programming 
contexts available manners of reducing from a power of 2 to an arbitrary range 
which should not show this artefact.  I will need to translate it into Python 
and may submit a PR for evaluation after having tested.

My test file illustrates that if randrange() proceeded from a pseudo-random 
variable emulating the uniform distribution in (0,1) and rescaled and rounded 
to an integer range, correlations between distinct ranges would be of a 
completely different nature than what the CPython method leads to.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39867] randrange(N) for N's in same dyadic blocs have excessive correlations when sharing identical seeds

2020-03-05 Thread Mark Dickinson


Mark Dickinson  added the comment:

These results are hardly surprising, given the algorithms used. But why would 
you consider this a problem?

If you want your randrange(n) and randrange(m) results to be independent, don't 
generate them from the exact same random source, starting with the same seed. 
Use different seeds, or generate one set of results followed by the other 
without resetting the seed in between.

--
nosy: +mark.dickinson, rhettinger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39867] randrange(N) for N's in same dyadic blocs have excessive correlations when sharing identical seeds

2020-03-05 Thread jfbu


New submission from jfbu :

We generate quadruples of random integers using randrange(n) and randrange(m) 
and count how many times the quadruples are identical, using the same random 
seed. Of course for nearby n and m (the real life example was with n==95 and 
m==97) we do expect matches. But we found orders of magnitude more than was 
expected.

The attached file demonstrates this by comparison with random()*n (with 
rounding) as alternative method to generate the random integers (we are aware 
this gives less uniformity for a given range, but these effects are completely 
negligible in comparison to the effect we test). For the latter the probability 
of matches is non-vanishing but orders of magnitude smaller than using 
randrange(n).

Here is an excerpt of our testing result. Each trial uses a random seed 
(selected via randrange(1)). Then 4 random integers in two given ranges 
are generated and compared. A hit is when all 4 match.

- with randrange():

n = 99, m = 124, 4135 hits among 1 trials
n = 99, m = 125, 3804 hits among 1 trials
n = 99, m = 126, 3803 hits among 1 trials
n = 99, m = 127, 3892 hits among 1 trials
n = 99, m = 128, 0 hits among 1 trials
n = 99, m = 129, 0 hits among 1 trials
n = 99, m = 130, 0 hits among 1 trials
n = 99, m = 131, 0 hits among 1 trials

- with random():

n = 99, m = 124, 0 hits among 1 trials
n = 99, m = 125, 0 hits among 1 trials
n = 99, m = 126, 0 hits among 1 trials
n = 99, m = 127, 0 hits among 1 trials
n = 99, m = 128, 0 hits among 1 trials
n = 99, m = 129, 0 hits among 1 trials
n = 99, m = 130, 0 hits among 1 trials
n = 99, m = 131, 0 hits among 1 trials

The test file has some hard-coded random seeds for reproductibility.

Although I did only limited testing it is flagrant there is completely abnormal 
correlation between randrange(n) and randrange(m) when the two integers have 
the same length in base 2.

Tested with 3.6 and 3.8.

--
files: testrandrange.py
messages: 363451
nosy: jfbu
priority: normal
severity: normal
status: open
title: randrange(N) for N's in same dyadic blocs have excessive correlations 
when sharing identical seeds
versions: Python 3.8
Added file: https://bugs.python.org/file48955/testrandrange.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com