Hello all, I think my last post on this topic may have been too convoluted, so here's another go. The goal here is to try and derive some well-defined strategies for designing ranges that make use of random number generators.
The purpose of this first email is to justify what I think should be the first basic rule of random ranges [that is, ranges where popFront() makes use of random or pseudo-random numbers to generate the new value of front()]. That rule is: ************************************************************************** * Reading multiple times from the start of the same random range, should * * produce different (and statistically independent) results each time. * ************************************************************************** To see why, let's start by considering a simple input range that provides an infinite stream of random numbers in the range [min, max). struct SimpleRandomRange(RNG = void) if(isUniformRNG!RNG || is(RNG == void)) { private real _min, _max, _u; static if (!is(RNG == void)) { private RNG _gen; this(real min, real max, RNG gen) { _min = min; _max = max; _gen = gen; _u = uniform(_min, _max, _gen); } } else { this(real min, real max) { _min = min; _max = max; _u = uniform(_min, _max); } } immutable bool empty = false; real front() @property pure nothrow const { return _u; } void popFront() { static if (is(RNG == void)) _u = uniform(_min, _max); else _u = uniform(_min, _max, _gen); } } auto simpleRandomRange()(real min, real max) { return SimpleRandomRange!void(min, max); } auto simpleRandomRange(RNG)(real min, real max, RNG gen) if (isUniformRNG!RNG) { return SimpleRandomRange!RNG(min, max, gen); } Let's consider the basics of how this works. If we don't specify a generator, auto r1 = simpleRandomRange(0.0, 1.0); ... then the call to uniform() will make use of the thread-local default random number generator, rndGen. If we do specify an RNG: Random gen; gen.seed(unpredictableSeed); auto r2 = simpleRandomRange(0.0, 1.0, gen); ... then SimpleRandomRange will store a copy of it to use when generating the random numbers. So, what happens when we try and read numbers from this? Let's try and generate some output from this range: auto r1 = simpleRandomRange(0.0L, 1.0L); auto take1 = r1.take(5); auto take2 = r1.take(5); writeln(take1); writeln(take1); writeln(take2); writeln(take2); ... which might give us for example: [0.816207, 0.40483, 0.638285, 0.00737022, 0.467244] [0.816207, 0.902301, 0.0961124, 0.611573, 0.579579] [0.816207, 0.788726, 0.614613, 0.613375, 0.136722] [0.816207, 0.0942176, 0.894744, 0.751959, 0.77816] We notice immediately that these sequences are all different, except for the very first number, which is always identical. What's more, it's identical for each of the two 'takes'. What's going on here? Well, the first value is being set in the constructor; subsequent values are being set by calls to the global RNG rndGen, and because its state is being updated, that affects all subsequent values generated. So, it means that (1) Different 'takes' from the same range will be different; (2) Reading twice from the same 'take' will also generate a different sequence. That in itself is not so bad -- but the always-the-same first value is a problem, as it will bias any statistics we generate. What happens instead if we specify a generator? Random gen; gen.seed(unpredictableSeed); auto r2 = simpleRandomRange(0.0L, 1.0L, gen); auto take3 = r2.take(5); writeln(take3); writeln(take3); ... then we may get for example: [0.833161, 0.11342, 0.872063, 0.661881, 0.039937] [0.833161, 0.11342, 0.872063, 0.661881, 0.039937] i.e., two identical sequences. Superficially this looks nice -- a random sequence that is consistent within the same 'take'. But now let's take another 5 numbers: auto take4 = r2.take(5); writeln(take4); writeln(take4); ... and we get again: [0.833161, 0.11342, 0.872063, 0.661881, 0.039937] [0.833161, 0.11342, 0.872063, 0.661881, 0.039937] Why is this the same? When we pass gen to the constructor of SimpleRandomRange, it copies it by value -- meaning that updates to the internally-stored _gen don't propagate back to the source RNG. This is extremely undesirable, because it means that we can generate very strong correlations between what ought to be independent random sequences. How can we resolve this dilemma? One thought would be to ensure that any RNG stored inside SimpleRandomRange is seeded unpredictably. Manually, without changing SimpleRandomRange, we can do this by making sure that we do something like: auto r3 = simpleRandomRange(0.0L, 1.0L, Random(unpredictableSeed)); auto take5 = r3.take(5); writeln(take5); writeln(take5); auto take6 = r3.take(5); writeln(take6); writeln(take6); ... but in fact this again produces identical output all round. The other alternative is for random number generators to be redefined as reference types. A simple experiment is to redefine MersenneTwisterEngine as a final class rather than a struct [N.B. this is probably _not_ the optimal way to turn RNGs into reference types, but is quick and easy to try out!]: auto gen2 = new MtClass19937(unpredictableSeed); auto r5 = simpleRandomRange(0.0L, 1.0L, gen2); auto take9 = r5.take(5); auto take10 = r5.take(5); writeln(take9); writeln(take9); writeln(take10); writeln(take10); ... which gives us output similar to the original example using rndGen: [0.844254, 0.136284, 0.0805684, 0.351376, 0.830413] [0.844254, 0.863786, 0.983373, 0.389967, 0.257907] [0.844254, 0.825822, 0.176731, 0.397709, 0.470349] [0.844254, 0.349027, 0.118591, 0.707107, 0.893357] i.e., an identical first value, but otherwise different. One might object that the previous example -- the one where simpleRandomRange was passed Random(unpredictableSeed) -- offers a superior outcome. After all, it generates a reproducible pseudo-random sequence that (theoretically) shouldn't generate correlations with other (pseudo-)random activity in the program. The fact that two different .take commands generate identical output is a feature, not a problem -- each time you're reading from the start of the range, so you'd expect to get the same values! To this I'd offer three responses. The first is that enforcing an unpredictable seed for the random range (in effect, making it operate like an independent thread in terms of its internal RNG) makes things difficult to debug as, for debugging purposes, you want to be able to have predictable seeds for streams of pseudo-random numbers. The second is that it seems to me that having a multiplicity of separate RNGs risks statistical reliability: it's difficult to be certain that two different "unpredictable" seeds won't in fact produce correlated sequences, and becomes more difficult if you have many different separately seeded RNGs. Finally, note that even if you could set aside these issues, the reproducibility of the sequence rests on the fact that your RNG is a pseudo-random number generator. If instead you pass SimpleRandomRange an RNG that reads from a true source of randomness (say, /dev/random) it will be impossible to generate predictable sequences. Hence the rule which I posited at the beginning of this email, and which I'll restate here, which seems to me the only way to ensure consistent behaviour of these entities regardless of the source of randomness used: ************************************************************************** * Reading multiple times from the start of the same random range, should * * produce different (and statistically independent) results each time. * ************************************************************************** As a corollary, this means that pseudo-random number generators (in Phobos and elsewhere) should be redefined as reference types, which has already been pointed out in D bugzilla: http://d.puremagic.com/issues/show_bug.cgi?id=7067 However, that's not the end of the story. Remember that first value from the range being always the same? This also needs dealing with, and (I think) is non-trivial. I will discuss this in the next email. In the meantime -- I'm interested in feedback on the proposed rule. :-) Thanks and best wishes, -- Joe