[issue9025] Non-uniformity in randrange for large arguments.

2010-09-06 Thread Raymond Hettinger

Raymond Hettinger rhettin...@users.sourceforge.net added the comment:

Put in a fix with r84576.  May come back to it to see if it can or should be 
optimized with C.  For now, this gets the job done.

--
resolution:  - fixed
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-08-07 Thread Raymond Hettinger

Changes by Raymond Hettinger rhettin...@users.sourceforge.net:


--
priority: normal - high

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-24 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

Distribution with my algorithm:

...
from collections import Counter
print Counter(_randint(6755399441055744) % 3 for _ in xrange(1))

= Counter({0L: 33342985, 2L: 5781, 1L: 33321234})

Distribution: {0: 0.333429853, 2: 0.57808, 1: 0.33321234}

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-24 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

A couple of points:

(1) In addition to documenting the extent of the repeatability, it would be 
good to have tests to prevent changes that inadvertently change the sequence of 
randrange values.

(2) For large arguments, cross-platform reproducibility is already a bit 
fragile.  For example, the _randbelow function depends on the system _log 
function---see the line

k = int(1.1 + _log(n-1, 2.0))

Now that we have the bit_length method available, it might be better to use 
that.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-23 Thread Raymond Hettinger

Raymond Hettinger rhettin...@users.sourceforge.net added the comment:

FWIW, here are two approaches to getting an equi-distributed version of 
int(n*random()) where 0  n = 2**53.  The first mirrors the approach currently 
in the code.  The second approach makes fewer calls to random().

def rnd1(n):
assert 0  n = 2**53
N = 1  (n-1).bit_length()
r = int(N * random())
while r = n:
r = int(N * random())
return r

def rnd2(n, N=153):
assert 0  n = N
NN = N - (N % n) + 0.0  # largest multiple of n = N
r = N * random()# a float with an integral value
while r = NN:
r = N * random()
return int(r) % n

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-23 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

Either of these looks good to me.

If the last line of the second is changed from return int(r) % n to return 
int(r) // (N // n) then it'll use the high-order bits of random()  instead of 
the low-order bits.  This doesn't matter for MT, but might matter for 
subclasses of Random using a different underlying generator.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-23 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

This wouldn't be the first time reproduceability is dropped, since reading from 
the docs:

“As an example of subclassing, the random module provides the WichmannHill 
class that implements an alternative generator in pure Python. The class 
provides a backward compatible way to reproduce results from earlier versions 
of Python, which used the Wichmann-Hill algorithm as the core generator.”

Also:

 FWIW, we spent ten years maintaining the ability to reproduce
 sequences.  It has become an implicit promise.

IMO it should either be documented explicitly, or be taken less dearly. There's 
not much value in an implicit promise that's only known by a select few.

(besides, as Terry said, I think most people are more concerned by the quality 
of the random distribution than by the reproduceability of sequences)

--
nosy: +pitrou

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-23 Thread Senthil Kumaran

Senthil Kumaran orsent...@gmail.com added the comment:

I guess, Antoine wanted to point out this:

Changed in version 2.3: MersenneTwister replaced Wichmann-Hill as the
default generator.

But as the paragraph points out Python did provide non default WichmanHill
class for generating repeatable sequences with older python.

My brief reading on this topic, does suggest that 'repeatability' is
an important requirement for any PRNG.

--
nosy: +orsenthil

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-23 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

BTW, the Wichmann-Hill code is gone in py3k, so that doc paragraph needs 
removing or updating.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-23 Thread Raymond Hettinger

Raymond Hettinger rhettin...@users.sourceforge.net added the comment:

Thanks guys, I've got it from here.

Some considerations for the PRNG are:
* equidistribution (for quality)
* repeatability from the same seed (even in multithreaded environments)
* quality and simplicity of API (for usability)
* speed (it matters whether a Monte Carlo simulation takes 5 minutes or 30 
minutes).

I'm looking at several ideas:
* updating the new randrange() to use rnd2() algorithm shown above (for 
equidistribution).
* possibly providing a C version of rnd2() and using it in randrange() for 
speed and for thread-safety.
* possibly updating shuffle() and choice() to use rnd2().
* moving the existing randrange() to randrange_quick() -- uses int(n * random) 
for speed and for reproducibility of previously created sequences.  
Alternatively, adding a recipe to the docs for recreating old sequences and not 
cluttering the code with backwards compatibility cruft.

Am raising the priority to normal because I think some effort needs to be made 
to address equidistribution in 3.2.  Will take some time to come-up with a good 
patch that balances quality, simplicity, speed, thread-safety, and 
reproducibility.   

May also consult with Tim Peters who has previously voiced concerns about 
stripping bits off of multiple calls to random() because the MT proofs make no 
guarantees about quality in those cases.  I don't think this is an issue in 
practice, but in theory when we start tossing out some of the calls to 
random(), we're also throwing away the guarantees of a long periodic, 623 
dimensions, uniformity, and equidistribution.

--
priority: low - normal

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-23 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 Some considerations for the PRNG are:
 * equidistribution (for quality)
 * repeatability from the same seed (even in multithreaded environments)

I believe a reasonable (com)promise would be to guarantee repeatability
accross a given set of bugfix releases (for example, accross all 2.6.x
releases). We shouldn't necessarily commit to repeatability accross
feature releases, especially if it conflicts with desireable
opportunities for improvement.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-23 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

 * possibly providing a C version of rnd2()

If recoding in C is acceptable, I think there may be better ( = simpler and 
faster) ways than doing a direct translation of rnd2.

For example, for small k, the following algorithm for randrange(k) suffices:

 - take a single 32-bit deviate (generated using genrand_int32)
 - multiply by k (a 32-by-32 to 64-bit widening multiply) and return
   the high 32-bits of the result, provided that the bottom half
   of the product is = 2**32 - k (almost always true, for small k).
 - consume extra random words as necessary in the case that the bottom
   half of the product is  2**32 - k.

I can provide code (with that 3rd step fully expanded) if you're interested in 
this approach.

This is likely to be significantly faster than a direct translation of rnd32, 
since in the common case it requires only: one 32-bit deviate from MT, one 
integer multiplication, one subtraction, and one comparison.  By comparison, 
rnd2 uses (at least) two 32-bit deviates and massages them into a float, before 
doing arithmetic with that float.

Though it's possible (even probable) that any speed gain would be insignificant 
in comparison to the rest of the Python machinery involved in a single 
randrange call.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-23 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

Just to illustrate, here's a patch that adds a method Random._smallrandbelow, 
based on the algorithm I described above.

--
Added file: http://bugs.python.org/file17755/_smallrandbelow.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-23 Thread Raymond Hettinger

Raymond Hettinger rhettin...@users.sourceforge.net added the comment:

Antoine, there does need to be repeatablity; there's no question about that.  
The open question for me is how to offer that repeatability in the cleanest 
manner.

People use random.seed() for reproducible tests.  They need it to have their 
studies become independently validatable, etc.  Some people are pickling the 
state of the RNG and need to restart where they left off, etc.

Mark, thanks for the alternative formulation.  I'll take a look when I get a 
chance.

I've got it from here.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-23 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

randint.py: another algorithm to generate a random integer in a range. It uses 
only operations on unsigned integers (no evil floatting point number). It calls 
tick() multiple times to generate enough entropy. It has an uniform 
distribution (if the input generator has an uniform distribution).

tick() is simply the output of the Mersenne Twister generator. The algorithm 
can be optimized (especially the part computing ndigits and scale).

--
Added file: http://bugs.python.org/file17757/randint.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-23 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 Antoine, there does need to be repeatablity; there's no question about
 that.

Well, that doesn't address my proposal of making it repeatable accross
bugfix releases only. There doesn't seem to be a strong use case for
perpetual repeatability.

If some people really need perpetual repeatability, I don't understand
how they can rely on Python anyway, since we don't make any such promise
explicitly (or do they somehow manage to read in your mind?). So,
realistically, they should already be using their custom-written
deterministic generators.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-21 Thread Terry J. Reedy

Terry J. Reedy tjre...@udel.edu added the comment:

'Random', without qualification, is commonly taken to mean 'with uniform 
distribution'. Otherwise it has no specific meaning and could well be a synonym 
for 'arbitrary' or 'haphazard'.

The behavior reported is buggy and in my opinion should be fixed if possible. I 
have done simulation research in the past and do not consider them minor. If I 
had results that depended on these functions, I might want to rerun with the 
fixed versions to make sure the end results were not affected. I would 
certainly want the fixed behavior for any future work.

I do not see any promise of reproducibility of sequences from version to 
version. I do not really see the point as one can rerun with the old Python 
version or copy the older random.py.

The old versions could be kept with with an 'old_' prefix and documented in a 
separate subsection that starts with Do not use these buggy old versions of x 
and y in new code. They are only present for those who want to reproduce old 
sequences. But I wonder how many people would use them.

--
nosy: +tjreedy

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-18 Thread Mark Dickinson

New submission from Mark Dickinson dicki...@gmail.com:

Not a serious bug, but worth noting:

The result of randrange(n) is not even close to uniform for large n.  Witness 
the obvious skew in the following (this takes a minute or two to run, so you 
might want to reduce the range argument):

Python 3.2a0 (py3k:81980, Jun 14 2010, 11:23:36)
[GCC 4.2.1 (SUSE Linux)] on linux2
Type help, copyright, credits or license for more information.
 from random import randrange
 from collections import Counter
 Counter(randrange(6755399441055744) % 3 for _ in range(1))
Counter({1: 37508130, 0: 33323818, 2: 29168052})

(The actual probabilities here are, as you might guess from the above numbers:  
{0: 1/3, 1: 3/8, 2: 7/24}.)

The cause:  for n  2**53, randrange(n) is effectively computed as int(random() 
* n).  For small n, there's a tiny bias involved, but this is still an 
effective method.  However, as n increases towards 2**53, the bias increases 
significantly.  (For n = 2**53, the random module uses a different strategy 
that *does* produce uniformly distributed results.)

A solution would be to lower the cutoff point where randrange() switches from 
using int(random() * n) to using the _randbelow method.

--
components: Library (Lib)
messages: 108095
nosy: mark.dickinson, rhettinger
priority: low
severity: normal
status: open
title: Non-uniformity in randrange for large arguments.
type: behavior
versions: Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-18 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

Note: the number 6755399441055744 is special:  it's 0.75 * 2**53, and was 
deliberately chosen so that the non-uniformity is easily exhibited by looking 
at residues modulo 3.  For other numbers of this size, the non-uniformity is 
just as bad, but demonstrating the non-uniformity clearly would have taken a 
little more work.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-18 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

Here's an example patch that removes any bias from randrange(n) (except for 
bias resulting from the imperfectness of the core MT generator).  I added a 
small private method to Modules/_randommodule.c to aid the computation.

This only fixes one instance of int(random() * n) in the Lib/random.py source;  
the other instances should be modified accordingly.

With this patch, randrange is a touch faster than before (20-30% speedup) for 
small arguments.

Is this worth pursuing?

--
keywords: +patch
Added file: http://bugs.python.org/file17706/issue9025.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-18 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

The nonuniformity of randrange has a knock-on effect in other random module 
functions.  For example, take a sample of 100 elements from 
range(6004799503160661), and take the smallest element from that sample.  Then 
the exact distribution of that smallest element is somewhat complicated, but 
you'd expect it to be even with probability very close to 50%.  But it turns 
out that it's roughly twice as likely to be even as to be odd.

 from random import sample
 from collections import Counter
 population = range(6004799503160661)
 Counter(min(sample(population, 100)) % 2 for _ in range(10))
Counter({0: 66810, 1: 33190})

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-18 Thread Alexander Belopolsky

Changes by Alexander Belopolsky belopol...@users.sourceforge.net:


--
nosy: +belopolsky

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-18 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@haypocalc.com:


--
nosy: +haypo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-18 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

Here's a more careful Python-only patch that fixes the bias in randrange and 
randint (but not in shuffle, choice or sample).  It should work well both for 
Mersenne Twister and for subclasses of Random that use a poorer PRNG with 
badly-behaved low-order bits.

--
Added file: http://bugs.python.org/file17707/issue9025_v2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-18 Thread Raymond Hettinger

Changes by Raymond Hettinger rhettin...@users.sourceforge.net:


--
assignee:  - rhettinger

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-18 Thread Raymond Hettinger

Raymond Hettinger rhettin...@users.sourceforge.net added the comment:

Will take a look at this in the next few days.
Am tempted to just either provide a recipe
or provide a new method.  That way sequences generated
by earlier python's are still reproducible.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-18 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

I would prefer to see correct algorithm in stdlib and a recipe for how to 
reproduce old sequences for the users who care.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-18 Thread Raymond Hettinger

Raymond Hettinger rhettin...@users.sourceforge.net added the comment:

FWIW, we spent ten years maintaining the ability to reproduce sequences.  It 
has become an implicit promise.
I'll take a look at the patch in the next few days.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9025] Non-uniformity in randrange for large arguments.

2010-06-18 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

Hmm.  I hadn't considered the reproducibility problem.

Does the module aim for reproducibility across all platforms *and* all versions 
of Python?  Or just one of those?

For small n, I think the patched version of randrange(n) produces the same 
sequence as before with very high probability, but not with certainty.  Since 
that sounds like a recipe for hard-to-find bugs, it might be better to 
deliberately perturb the outputs somehow so that the sequence is obviously 
different from before, rather than subtly different.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9025
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com