Re: [Rd] Extreme bunching of random values from runif with Mersenne-Twister seed
I'll point out that there is there is a large literature on generating pseudo random numbers for parallel processes, and it is not as easy as one (at least me) would intuitively think. By a contra-positive like thinking one might guess that it will not be easy to pick seeds in a way that will produce independent sequences. (I'm a bit confused about the objective but) If the objective is to produce independent sequence from some different seeds then the RNGs for parallel processing might be a good place to start. (And, BTW, if you want to reproduce parallel generated random numbers you need to keep track of both the starting seed and the number of nodes.) Paul Gilbert On 11/05/2017 10:58 AM, peter dalgaard wrote: On 5 Nov 2017, at 15:17 , Duncan Murdoch wrote: On 04/11/2017 10:20 PM, Daniel Nordlund wrote: Tirthankar, "random number generators" do not produce random numbers. Any given generator produces a fixed sequence of numbers that appear to meet various tests of randomness. By picking a seed you enter that sequence in a particular place and subsequent numbers in the sequence appear to be unrelated. There are no guarantees that if YOU pick a SET of seeds they won't produce a set of values that are of a similar magnitude. You can likely solve your problem by following Radford Neal's advice of not using the the first number from each seed. However, you don't need to use anything more than the second number. So, you can modify your function as follows: function(x) { set.seed(x, kind = "default") y = runif(2, 17, 26) return(y[2]) } Hope this is helpful, That's assuming that the chosen seeds are unrelated to the function output, which seems unlikely on the face of it. You can certainly choose a set of seeds that give high values on the second draw just as easily as you can choose seeds that give high draws on the first draw. The interesting thing about this problem is that Tirthankar doesn't believe that the seed selection process is aware of the function output. I would say that it must be, and he should be investigating how that happens if he is worried about the output, he shouldn't be worrying about R's RNG. Hmm, no. The basic issue is that RNGs are constructed so that with x_{n+1} = f(x_n), x_1, x_2, x_3,... will look random, not so that f(s_1), f(s_2), f(s_3), ... will look random for any s_1, s_2, ... . This is true, even if seeds s_1, s_2, ... are not chosen so as to mess with the RNG. In the present case, it seems that the seeds around 86e6 tend to give similar output. On the other hand, it is not _just_ the similarity in magnitude that does it, try e.g. s <- as.integer(runif(100, 86.54e6, 86.98e6)) r <- sapply(s, function(s){set.seed(s); runif(1,17,26)}) plot(s,r, pch=".") and no obvious pattern emerges. My best guess is that the seeds are not only of similar magnitude, but also have other bit-pattern similarities. (Isn't there a Knuth quote to the effect that "Every random number generator will fail in at least one application"?) One remaining issue is whether it is really true that the same seeds givee different output on different platforms. That shouldn't happen, I believe. Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Extreme bunching of random values from runif with Mersenne-Twister seed
> On 5 Nov 2017, at 15:17 , Duncan Murdoch wrote: > > On 04/11/2017 10:20 PM, Daniel Nordlund wrote: >> Tirthankar, >> "random number generators" do not produce random numbers. Any given >> generator produces a fixed sequence of numbers that appear to meet >> various tests of randomness. By picking a seed you enter that sequence >> in a particular place and subsequent numbers in the sequence appear to >> be unrelated. There are no guarantees that if YOU pick a SET of seeds >> they won't produce a set of values that are of a similar magnitude. >> You can likely solve your problem by following Radford Neal's advice of >> not using the the first number from each seed. However, you don't need >> to use anything more than the second number. So, you can modify your >> function as follows: >> function(x) { >>set.seed(x, kind = "default") >>y = runif(2, 17, 26) >>return(y[2]) >> } >> Hope this is helpful, > > That's assuming that the chosen seeds are unrelated to the function output, > which seems unlikely on the face of it. You can certainly choose a set of > seeds that give high values on the second draw just as easily as you can > choose seeds that give high draws on the first draw. > > The interesting thing about this problem is that Tirthankar doesn't believe > that the seed selection process is aware of the function output. I would say > that it must be, and he should be investigating how that happens if he is > worried about the output, he shouldn't be worrying about R's RNG. > Hmm, no. The basic issue is that RNGs are constructed so that with x_{n+1} = f(x_n), x_1, x_2, x_3,... will look random, not so that f(s_1), f(s_2), f(s_3), ... will look random for any s_1, s_2, ... . This is true, even if seeds s_1, s_2, ... are not chosen so as to mess with the RNG. In the present case, it seems that the seeds around 86e6 tend to give similar output. On the other hand, it is not _just_ the similarity in magnitude that does it, try e.g. s <- as.integer(runif(100, 86.54e6, 86.98e6)) r <- sapply(s, function(s){set.seed(s); runif(1,17,26)}) plot(s,r, pch=".") and no obvious pattern emerges. My best guess is that the seeds are not only of similar magnitude, but also have other bit-pattern similarities. (Isn't there a Knuth quote to the effect that "Every random number generator will fail in at least one application"?) One remaining issue is whether it is really true that the same seeds givee different output on different platforms. That shouldn't happen, I believe. > Duncan Murdoch > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Extreme bunching of random values from runif with Mersenne-Twister seed
Duncan, Daniel, Thanks and indeed we intend to take the advice that Radford and Lukas have provided in this thread. I do want to re-iterate that the generating system itself cannot have any conception of the use of form IDs as seeds for a PRNG *and* the system itself only generates a sequence of form IDs, which are then filtered & are passed to our API depending on basic rules on user inputs in that form. Either in our production system a truly remarkable probability event has happened or that the Mersenne-Twister is very susceptible to the first draw in the sequence to be correlated across closely related seeds. Both of these require understanding the Mersenne-Twister better. The solution here as has been suggested is to use a different RNG with adequate burn-in (in which case even MT would work) or to look more carefully at our problem and understand if we just need a hash function. In either case, we will cease to question R's implementation of Mersenne-Twister (for the time being). :) T On Sun, Nov 5, 2017 at 7:47 PM, Duncan Murdoch wrote: > On 04/11/2017 10:20 PM, Daniel Nordlund wrote: > >> Tirthankar, >> >> "random number generators" do not produce random numbers. Any given >> generator produces a fixed sequence of numbers that appear to meet >> various tests of randomness. By picking a seed you enter that sequence >> in a particular place and subsequent numbers in the sequence appear to >> be unrelated. There are no guarantees that if YOU pick a SET of seeds >> they won't produce a set of values that are of a similar magnitude. >> >> You can likely solve your problem by following Radford Neal's advice of >> not using the the first number from each seed. However, you don't need >> to use anything more than the second number. So, you can modify your >> function as follows: >> >> function(x) { >> set.seed(x, kind = "default") >> y = runif(2, 17, 26) >> return(y[2]) >> } >> >> Hope this is helpful, >> > > That's assuming that the chosen seeds are unrelated to the function > output, which seems unlikely on the face of it. You can certainly choose a > set of seeds that give high values on the second draw just as easily as you > can choose seeds that give high draws on the first draw. > > The interesting thing about this problem is that Tirthankar doesn't > believe that the seed selection process is aware of the function output. I > would say that it must be, and he should be investigating how that happens > if he is worried about the output, he shouldn't be worrying about R's RNG. > > Duncan Murdoch > > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Extreme bunching of random values from runif with Mersenne-Twister seed
On 04/11/2017 10:20 PM, Daniel Nordlund wrote: Tirthankar, "random number generators" do not produce random numbers. Any given generator produces a fixed sequence of numbers that appear to meet various tests of randomness. By picking a seed you enter that sequence in a particular place and subsequent numbers in the sequence appear to be unrelated. There are no guarantees that if YOU pick a SET of seeds they won't produce a set of values that are of a similar magnitude. You can likely solve your problem by following Radford Neal's advice of not using the the first number from each seed. However, you don't need to use anything more than the second number. So, you can modify your function as follows: function(x) { set.seed(x, kind = "default") y = runif(2, 17, 26) return(y[2]) } Hope this is helpful, That's assuming that the chosen seeds are unrelated to the function output, which seems unlikely on the face of it. You can certainly choose a set of seeds that give high values on the second draw just as easily as you can choose seeds that give high draws on the first draw. The interesting thing about this problem is that Tirthankar doesn't believe that the seed selection process is aware of the function output. I would say that it must be, and he should be investigating how that happens if he is worried about the output, he shouldn't be worrying about R's RNG. Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Extreme bunching of random values from runif with Mersenne-Twister seed
Tirthankar, "random number generators" do not produce random numbers. Any given generator produces a fixed sequence of numbers that appear to meet various tests of randomness. By picking a seed you enter that sequence in a particular place and subsequent numbers in the sequence appear to be unrelated. There are no guarantees that if YOU pick a SET of seeds they won't produce a set of values that are of a similar magnitude. You can likely solve your problem by following Radford Neal's advice of not using the the first number from each seed. However, you don't need to use anything more than the second number. So, you can modify your function as follows: function(x) { set.seed(x, kind = "default") y = runif(2, 17, 26) return(y[2]) } Hope this is helpful, Dan -- Daniel Nordlund Port Townsend, WA USA On 11/3/2017 11:30 AM, Tirthankar Chakravarty wrote: Bill, Appreciate the point that both you and Serguei are making, but the sequence in question is not a selected or filtered set. These are values as observed in a sequence from a mechanism described below. The probabilities required to generate this exact sequence in the wild seem staggering to me. T On Fri, Nov 3, 2017 at 11:27 PM, William Dunlap wrote: Another other generator is subject to the same problem with the same probabilitiy. Filter(function(s){set.seed(s, kind="Knuth-TAOCP-2002");runif(1,17,26)>25.99}, 1:1) [1] 280 415 826 1372 2224 2544 3270 3594 3809 4116 4236 5018 5692 7043 7212 7364 7747 9256 9491 9568 9886 Bill Dunlap TIBCO Software wdunlap tibco.com On Fri, Nov 3, 2017 at 10:31 AM, Tirthankar Chakravarty < tirthankar.li...@gmail.com> wrote: Bill, I have clarified this on SO, and I will copy that clarification in here: "Sure, we tested them on other 8-digit numbers as well & we could not replicate. However, these are honest-to-goodness numbers generated by a non-adversarial system that has no conception of these numbers being used for anything other than a unique key for an entity -- these are not a specially constructed edge case. Would be good to know what seeds will and will not work, and why." These numbers are generated by an application that serves a form, and associates form IDs in a sequence. The application calls our API depending on the form values entered by users, which in turn calls our R code that executes some code that needs an RNG. Since the API has to be stateless, to be able to replicate the results for possible debugging, we need to draw random numbers in a way that we can replicate the results of the API response -- we use the form ID as seeds. I repeat, there is no design or anything adversarial about the way that these numbers were generated -- the system generating these numbers and the users entering inputs have no conception of our use of an RNG -- this is meant to just be a random sequence of form IDs. This issue was discovered completely by chance when the output of the API was observed to be highly non-random. It is possible that it is a 1/10^8 chance, but that is hard to believe, given that the API hit depends on user input. Note also that the issue goes away when we use a different RNG as mentioned below. T On Fri, Nov 3, 2017 at 9:58 PM, William Dunlap wrote: The random numbers in a stream initialized with one seed should have about the desired distribution. You don't win by changing the seed all the time. Your seeds caused the first numbers of a bunch of streams to be about the same, but the second and subsequent entries in each stream do look uniformly distributed. You didn't say what your 'upstream process' was, but it is easy to come up with seeds that give about the same first value: Filter(function(s){set.seed(s);runif(1,17,26)>25.99}, 1:1) [1] 514 532 1951 2631 3974 4068 4229 6092 6432 7264 9090 Bill Dunlap TIBCO Software wdunlap tibco.com On Fri, Nov 3, 2017 at 12:49 AM, Tirthankar Chakravarty < tirthankar.li...@gmail.com> wrote: This is cross-posted from SO (https://stackoverflow.com/q/4 7079702/1414455), but I now feel that this needs someone from R-Devel to help understand why this is happening. We are facing a weird situation in our code when using R's [`runif`][1] and setting seed with `set.seed` with the `kind = NULL` option (which resolves, unless I am mistaken, to `kind = "default"`; the default being `"Mersenne-Twister"`). We set the seed using (8 digit) unique IDs generated by an upstream system, before calling `runif`: seeds = c( "86548915", "86551615", "86566163", "86577411", "86584144", "86584272", "86620568", "86724613", "86756002", "86768593", "86772411", "86781516", "86794389", "86805854", "86814600", "86835092", "86874179", "86876466", "86901193", "86987847", "86988080") random_values = sapply(seeds, function(x) { set.seed(x) y = runif(1, 17, 26) return(y) }) This gives values that are **extremely** bunched together.