New submission from Dennis Sweeney <sweeney.dennis...@gmail.com>:
Problem: Random.choices() never returns sequence entries at large odd indices. For example: >>> import random >>> [x % 2 for x in random.choices(range(2**53), k=20)] [0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1] >>> [x % 2 for x in random.choices(range(2**54), k=20)] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] This might require a 64-bit build. Possible solutions: 1) Do nothing. Document the quirk, and recommend using [seq[randrange(len(seq))] for i in range(k)] 2) Use integer arithmetic rather than random(). Using _randbelow() in the unweighted case alone breaks test_random.test_choices_algorithms(), but adding a case for `if isinstance(total, int)` and using _randbelow() there as well creates a difference between the reproducability of weights=[1]*n versus weights=[1.0]*n. 3) Raise an Exception or Warning when loss of precision is likely to occur. Something like `if n > 2.0 ** BPF: raise OverFlowError(...)`. All smaller integers are exactly representable, but that doesn't quite eliminate the bias (see below). ----------------------- There is bias in random.choices() even when n <= 2**53: >>> Counter(val % 3 for val in ... choices(range(2**53 - 2**51), k=1_000_000)) Counter({1: 374976, 0: 333613, 2: 291411}) >>> Counter(val % 3 for val in ... choices(range(2**53 - 2**51), k=1_000_000) ... if val < 2**51) Counter({0: 166669, 1: 83664, 2: 83293}) For some explanation, imagine floats have only 3 bits of precision. Then random.choices() would do something like this on a sequence of length 7: # Random 3-significant-bit floats def random(): return randrange(8) / 8 # selecting from a pool of size 2**BPF - 1 >>> Counter(floor(random() * 7) for _ in range(1_000_000)) Counter({0: 249566, 5: 125388, 6: 125251, 2: 125202, 1: 125149, 3: 124994, 4: 124450}) Issue: both 0/8 and 7/8 fall in [0, 1). Similar biases: >>> Counter(floor(randrange(16)/16*15) for _ in range(1_000_000)) Counter({0: 124549, 5: 62812, 14: 62807, 6: 62766, 10: 62740, 3: 62716, 12: 62658, 13: 62649, 9: 62473, 8: 62376, 2: 62373, 4: 62346, 11: 62293, 1: 62278, 7: 62164}) >>> Counter(floor(randrange(16)/16*14) for _ in range(1_000_000)) Counter({7: 125505, 0: 124962, 11: 62767, 9: 62728, 6: 62692, 10: 62684, 4: 62625, 3: 62465, 12: 62428, 13: 62397, 2: 62332, 8: 62212, 5: 62176, 1: 62027}) >>> def bias(bits, n): ... c = Counter(floor(randrange(2**bits)/(2**bits) * n) ... for _ in range(1_000_000)) ... m = min(c.values()) ... preferred = [k for k, v in c.items() if v / m > 1.5] ... preferred.sort() ... return preferred >>> bias(bits=4, n=15) [0] >>> bias(bits=4, n=14) [0, 7] >>> bias(bits=8, n=250) [0, 41, 83, 125, 166, 208] # Observation of which numbers may be biased toward # I have no proof that this is universally true >>> [250 * i // (2**8 - 250) for i in range(6)] [0, 41, 83, 125, 166, 208] If using the OverflowError approach, I think using a threshold of 2**BPF/2 would only make some "bins" (indices) have 1.5x the probability of others rather than 2x, and thus would not remove the problem either. ---------- components: Library (Lib) messages: 393285 nosy: Dennis Sweeney, rhettinger priority: normal severity: normal status: open title: Bias in random.choices(long_sequence) type: behavior versions: Python 3.10, Python 3.11 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue44080> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com