Re: [Rd] Extreme bunching of random values from runif with Mersenne-Twister seed
Duncan, Daniel, Thanks and indeed we intend to take the advice that Radford and Lukas have provided in this thread. I do want to re-iterate that the generating system itself cannot have any conception of the use of form IDs as seeds for a PRNG *and* the system itself only generates a sequence of form IDs, which are then filtered & are passed to our API depending on basic rules on user inputs in that form. Either in our production system a truly remarkable probability event has happened or that the Mersenne-Twister is very susceptible to the first draw in the sequence to be correlated across closely related seeds. Both of these require understanding the Mersenne-Twister better. The solution here as has been suggested is to use a different RNG with adequate burn-in (in which case even MT would work) or to look more carefully at our problem and understand if we just need a hash function. In either case, we will cease to question R's implementation of Mersenne-Twister (for the time being). :) T On Sun, Nov 5, 2017 at 7:47 PM, Duncan Murdoch wrote: > On 04/11/2017 10:20 PM, Daniel Nordlund wrote: > >> Tirthankar, >> >> "random number generators" do not produce random numbers. Any given >> generator produces a fixed sequence of numbers that appear to meet >> various tests of randomness. By picking a seed you enter that sequence >> in a particular place and subsequent numbers in the sequence appear to >> be unrelated. There are no guarantees that if YOU pick a SET of seeds >> they won't produce a set of values that are of a similar magnitude. >> >> You can likely solve your problem by following Radford Neal's advice of >> not using the the first number from each seed. However, you don't need >> to use anything more than the second number. So, you can modify your >> function as follows: >> >> function(x) { >> set.seed(x, kind = "default") >> y = runif(2, 17, 26) >> return(y[2]) >> } >> >> Hope this is helpful, >> > > That's assuming that the chosen seeds are unrelated to the function > output, which seems unlikely on the face of it. You can certainly choose a > set of seeds that give high values on the second draw just as easily as you > can choose seeds that give high draws on the first draw. > > The interesting thing about this problem is that Tirthankar doesn't > believe that the seed selection process is aware of the function output. I > would say that it must be, and he should be investigating how that happens > if he is worried about the output, he shouldn't be worrying about R's RNG. > > Duncan Murdoch > > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Extreme bunching of random values from runif with Mersenne-Twister seed
Bill, Appreciate the point that both you and Serguei are making, but the sequence in question is not a selected or filtered set. These are values as observed in a sequence from a mechanism described below. The probabilities required to generate this exact sequence in the wild seem staggering to me. T On Fri, Nov 3, 2017 at 11:27 PM, William Dunlap wrote: > Another other generator is subject to the same problem with the same > probabilitiy. > > > Filter(function(s){set.seed(s, > > kind="Knuth-TAOCP-2002");runif(1,17,26)>25.99}, > 1:1) > [1] 280 415 826 1372 2224 2544 3270 3594 3809 4116 4236 5018 5692 7043 > 7212 7364 7747 9256 9491 9568 9886 > > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > On Fri, Nov 3, 2017 at 10:31 AM, Tirthankar Chakravarty < > tirthankar.li...@gmail.com> wrote: > >> >> Bill, >> >> I have clarified this on SO, and I will copy that clarification in here: >> >> "Sure, we tested them on other 8-digit numbers as well & we could not >> replicate. However, these are honest-to-goodness numbers generated by a >> non-adversarial system that has no conception of these numbers being used >> for anything other than a unique key for an entity -- these are not a >> specially constructed edge case. Would be good to know what seeds will and >> will not work, and why." >> >> These numbers are generated by an application that serves a form, and >> associates form IDs in a sequence. The application calls our API depending >> on the form values entered by users, which in turn calls our R code that >> executes some code that needs an RNG. Since the API has to be stateless, to >> be able to replicate the results for possible debugging, we need to draw >> random numbers in a way that we can replicate the results of the API >> response -- we use the form ID as seeds. >> >> I repeat, there is no design or anything adversarial about the way that >> these numbers were generated -- the system generating these numbers and >> the users entering inputs have no conception of our use of an RNG -- this >> is meant to just be a random sequence of form IDs. This issue was >> discovered completely by chance when the output of the API was observed to >> be highly non-random. It is possible that it is a 1/10^8 chance, but that >> is hard to believe, given that the API hit depends on user input. Note also >> that the issue goes away when we use a different RNG as mentioned below. >> >> T >> >> On Fri, Nov 3, 2017 at 9:58 PM, William Dunlap wrote: >> >>> The random numbers in a stream initialized with one seed should have >>> about the desired distribution. You don't win by changing the seed all the >>> time. Your seeds caused the first numbers of a bunch of streams to be >>> about the same, but the second and subsequent entries in each stream do >>> look uniformly distributed. >>> >>> You didn't say what your 'upstream process' was, but it is easy to come >>> up with seeds that give about the same first value: >>> >>> > Filter(function(s){set.seed(s);runif(1,17,26)>25.99}, 1:1) >>> [1] 514 532 1951 2631 3974 4068 4229 6092 6432 7264 9090 >>> >>> >>> >>> Bill Dunlap >>> TIBCO Software >>> wdunlap tibco.com >>> >>> On Fri, Nov 3, 2017 at 12:49 AM, Tirthankar Chakravarty < >>> tirthankar.li...@gmail.com> wrote: >>> >>>> This is cross-posted from SO (https://stackoverflow.com/q/4 >>>> 7079702/1414455), >>>> but I now feel that this needs someone from R-Devel to help understand >>>> why >>>> this is happening. >>>> >>>> We are facing a weird situation in our code when using R's [`runif`][1] >>>> and >>>> setting seed with `set.seed` with the `kind = NULL` option (which >>>> resolves, >>>> unless I am mistaken, to `kind = "default"`; the default being >>>> `"Mersenne-Twister"`). >>>> >>>> We set the seed using (8 digit) unique IDs generated by an upstream >>>> system, >>>> before calling `runif`: >>>> >>>> seeds = c( >>>> "86548915", "86551615", "86566163", "86577411", "86584144", >>>> "86584272", "86620568", "86724613", "86756002", "86768593", >>>> "86772411", >>>> "8
Re: [Rd] Extreme bunching of random values from runif with Mersenne-Twister seed
Bill, I have clarified this on SO, and I will copy that clarification in here: "Sure, we tested them on other 8-digit numbers as well & we could not replicate. However, these are honest-to-goodness numbers generated by a non-adversarial system that has no conception of these numbers being used for anything other than a unique key for an entity -- these are not a specially constructed edge case. Would be good to know what seeds will and will not work, and why." These numbers are generated by an application that serves a form, and associates form IDs in a sequence. The application calls our API depending on the form values entered by users, which in turn calls our R code that executes some code that needs an RNG. Since the API has to be stateless, to be able to replicate the results for possible debugging, we need to draw random numbers in a way that we can replicate the results of the API response -- we use the form ID as seeds. I repeat, there is no design or anything adversarial about the way that these numbers were generated -- the system generating these numbers and the users entering inputs have no conception of our use of an RNG -- this is meant to just be a random sequence of form IDs. This issue was discovered completely by chance when the output of the API was observed to be highly non-random. It is possible that it is a 1/10^8 chance, but that is hard to believe, given that the API hit depends on user input. Note also that the issue goes away when we use a different RNG as mentioned below. T On Fri, Nov 3, 2017 at 9:58 PM, William Dunlap wrote: > The random numbers in a stream initialized with one seed should have about > the desired distribution. You don't win by changing the seed all the > time. Your seeds caused the first numbers of a bunch of streams to be > about the same, but the second and subsequent entries in each stream do > look uniformly distributed. > > You didn't say what your 'upstream process' was, but it is easy to come up > with seeds that give about the same first value: > > > Filter(function(s){set.seed(s);runif(1,17,26)>25.99}, 1:1) > [1] 514 532 1951 2631 3974 4068 4229 6092 6432 7264 9090 > > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > On Fri, Nov 3, 2017 at 12:49 AM, Tirthankar Chakravarty < > tirthankar.li...@gmail.com> wrote: > >> This is cross-posted from SO (https://stackoverflow.com/q/4 >> 7079702/1414455), >> but I now feel that this needs someone from R-Devel to help understand why >> this is happening. >> >> We are facing a weird situation in our code when using R's [`runif`][1] >> and >> setting seed with `set.seed` with the `kind = NULL` option (which >> resolves, >> unless I am mistaken, to `kind = "default"`; the default being >> `"Mersenne-Twister"`). >> >> We set the seed using (8 digit) unique IDs generated by an upstream >> system, >> before calling `runif`: >> >> seeds = c( >> "86548915", "86551615", "86566163", "86577411", "86584144", >> "86584272", "86620568", "86724613", "86756002", "86768593", >> "86772411", >> "86781516", "86794389", "86805854", "86814600", "86835092", >> "86874179", >> "86876466", "86901193", "86987847", "86988080") >> >> random_values = sapply(seeds, function(x) { >> set.seed(x) >> y = runif(1, 17, 26) >> return(y) >> }) >> >> This gives values that are **extremely** bunched together. >> >> > summary(random_values) >>Min. 1st Qu. MedianMean 3rd Qu.Max. >> 25.13 25.36 25.66 25.58 25.83 25.94 >> >> This behaviour of `runif` goes away when we use `kind = >> "Knuth-TAOCP-2002"`, and we get values that appear to be much more evenly >> spread out. >> >> random_values = sapply(seeds, function(x) { >> set.seed(x, kind = "Knuth-TAOCP-2002") >> y = runif(1, 17, 26) >> return(y) >> }) >> >> *Output omitted.* >> >> --- >> >> **The most interesting thing here is that this does not happen on Windows >> -- only happens on Ubuntu** (`sessionInfo` output for Ubuntu & Windows >> below). >> >> # Windows output: # >> >> > seeds = c( >> + "86548915", "86551615", "86566163", "86577411", "86584144", >> + "86584272", &qu
Re: [Rd] Extreme bunching of random values from runif with Mersenne-Twister seed
Martin, Thanks for the helpful reply. Alas I had forgotten that (implied) unfavorable comparisons of *nix systems with Windows systems would likely draw irate (but always substantive) responses on the R-devel list -- poor phrasing on my part. :) Regardless, let me try to address some of the concerns related to the construction of the MRE itself and try to see if we can clean away the shrubbery & zero down on the core issue, since I continue to believe that this is an issue with either R's implementation or a bad interaction of the seeds supplied with the Mersenne-Twister algorithm itself. The latter would require a deeper understanding of the algorithm than I have at the moment. If we can rule out the former through this thread, then I will pursue the latter solution path. Responses inline below, but summarizing: 1. All examples now are run using "R CMD BATCH --vanilla" as you have suggested, to ensure that no other loaded packages or namespace changes have interfered with the behaviour of `set.seed`. 2. Converting the character vector to integer vector has no impact on the output. 3. Upgrading to the latest version of R has no impact on the output. 4. Multiplying the seed vector by 10L causes the behaviour to vanish, calling into question the large integer theory. On Fri, Nov 3, 2017 at 3:09 PM, Martin Maechler wrote: > Why R-devel -- R-help would have been appropriate: > > It seems you have not read the help page for > set.seed as I expect it from posters to R-devel. > Why would you use strings instead of integers if you *had* read it ? > The manual (which we did read) says: seed a single value, interpreted as an integer, We were confident of R coercing characters to integers correctly. We tested, prior to making this posting that the behaviour remains intact if we change the `seeds` variable from a character vector to the "equivalent" integer vector by hand. > seeds = c(86548915L, 86551615L, 86566163L, 86577411L, 86584144L, 86584272L, + 86620568L, 86724613L, 86756002L, 86768593L, 86772411L, 86781516L, + 86794389L, 86805854L, 86814600L, 86835092L, 86874179L, 86876466L, + 86901193L, 86987847L, 86988080L) > > random_values = sapply(seeds, function(x) { + set.seed(x) + y = runif(1, 17, 26) + return(y) + }) > > summary(random_values) Min. 1st Qu. MedianMean 3rd Qu.Max. 25.13 25.36 25.66 25.58 25.83 25.94 > > We are facing a weird situation in our code when using R's > > [`runif`][1] and setting seed with `set.seed` with the > > `kind = NULL` option (which resolves, unless I am > > mistaken, to `kind = "default"`; the default being > > `"Mersenne-Twister"`). > > again this is not what the help page says; rather > > | The use of ‘kind = NULL’ or ‘normal.kind = NULL’ in ‘RNGkind’ or > | ‘set.seed’ selects the currently-used generator (including that > | used in the previous session if the workspace has been restored): > | if no generator has been used it selects ‘"default"’. > > but as you have > 90 (!!) packages in your sessionInfo() below, > why should we (or you) know if some of the things you did > before or (implicitly) during loading all these packages did not > change the RNG kind ? > Agreed. We are running this system in production, and we will need `set.seed` to behave reliably with this session, however, as you say, we are claiming that there is an issue with the PRNG, so should isolate to an environment that does not have any of the attendant potential confounding factors that come with having 90 packages loaded (did you count?). As mentioned above, we have rerun all examples using "R CMD BATCH --vanilla" and we can report that the output is unchanged. > > > We set the seed using (8 digit) unique IDs generated by an > > upstream system, before calling `runif`: > > > seeds = c( "86548915", "86551615", "86566163", > > "86577411", "86584144", "86584272", "86620568", > > "86724613", "86756002", "86768593", "86772411", > > "86781516", "86794389", "86805854", "86814600", > > "86835092", "86874179", "86876466", "86901193", > > "86987847", "86988080") > > > random_values = sapply(seeds, function(x) { > > set.seed(x) > > y = runif(1, 17, 26) > > return(y) > > }) > > Why do you do that? > > 1) You should set the seed *once*, not multiple times in one simulation. > This code is written like this since this seed is set every time the function (API) is called for call-level replicability. It doesn't make a lot of sense in an MRE, but this is a critical component of the larger function. We do acknowledge that for any one of the seeds in the vector `seeds` the vector of draws appears to have the uniform distribution. > 2) Assuming that your strings are correctly translated to integers >and the same on all platforms, independent of locales (!) etc, >you are again not following the simple instruction on the help page: > > ‘set.seed’ uses a single integer argume
[Rd] Extreme bunching of random values from runif with Mersenne-Twister seed
This is cross-posted from SO (https://stackoverflow.com/q/47079702/1414455), but I now feel that this needs someone from R-Devel to help understand why this is happening. We are facing a weird situation in our code when using R's [`runif`][1] and setting seed with `set.seed` with the `kind = NULL` option (which resolves, unless I am mistaken, to `kind = "default"`; the default being `"Mersenne-Twister"`). We set the seed using (8 digit) unique IDs generated by an upstream system, before calling `runif`: seeds = c( "86548915", "86551615", "86566163", "86577411", "86584144", "86584272", "86620568", "86724613", "86756002", "86768593", "86772411", "86781516", "86794389", "86805854", "86814600", "86835092", "86874179", "86876466", "86901193", "86987847", "86988080") random_values = sapply(seeds, function(x) { set.seed(x) y = runif(1, 17, 26) return(y) }) This gives values that are **extremely** bunched together. > summary(random_values) Min. 1st Qu. MedianMean 3rd Qu.Max. 25.13 25.36 25.66 25.58 25.83 25.94 This behaviour of `runif` goes away when we use `kind = "Knuth-TAOCP-2002"`, and we get values that appear to be much more evenly spread out. random_values = sapply(seeds, function(x) { set.seed(x, kind = "Knuth-TAOCP-2002") y = runif(1, 17, 26) return(y) }) *Output omitted.* --- **The most interesting thing here is that this does not happen on Windows -- only happens on Ubuntu** (`sessionInfo` output for Ubuntu & Windows below). # Windows output: # > seeds = c( + "86548915", "86551615", "86566163", "86577411", "86584144", + "86584272", "86620568", "86724613", "86756002", "86768593", "86772411", + "86781516", "86794389", "86805854", "86814600", "86835092", "86874179", + "86876466", "86901193", "86987847", "86988080") > > random_values = sapply(seeds, function(x) { + set.seed(x) + y = runif(1, 17, 26) + return(y) + }) > > summary(random_values) Min. 1st Qu. MedianMean 3rd Qu.Max. 17.32 20.14 23.00 22.17 24.07 25.90 Can someone help understand what is going on? Ubuntu -- R version 3.4.0 (2017-04-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.2 LTS Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.6.0 LAPACK: /usr/lib/lapack/liblapack.so.3.6.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 [9] LC_ADDRESS=en_US.UTF-8LC_TELEPHONE=en_US.UTF-8 [11] LC_MEASUREMENT=en_US.UTF-8LC_IDENTIFICATION=en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] RMySQL_0.10.8 DBI_0.6-1 [3] jsonlite_1.4tidyjson_0.2.2 [5] optiRum_0.37.3 lubridate_1.6.0 [7] httr_1.2.1 gdata_2.18.0 [9] XLConnect_0.2-12XLConnectJars_0.2-12 [11] data.table_1.10.4 stringr_1.2.0 [13] readxl_1.0.0xlsx_0.5.7 [15] xlsxjars_0.6.1 rJava_0.9-8 [17] sqldf_0.4-10RSQLite_1.1-2 [19] gsubfn_0.6-6proto_1.0.0 [21] dplyr_0.5.0 purrr_0.2.4 [23] readr_1.1.1 tidyr_0.6.3 [25] tibble_1.3.0tidyverse_1.1.1 [27] rBayesianOptimization_1.1.0 xgboost_0.6-4 [29] MLmetrics_1.1.1 caret_6.0-76 [31] ROCR_1.0-7 gplots_3.0.1 [33] effects_3.1-2 pROC_1.10.0 [35] pscl_1.4.9 lattice_0.20-35 [37] MASS_7.3-47 ggplot2_2.2.1 loaded via a namespace (and not attached): [1] splines_3.4.0 foreach_1.4.3 AUC_0.3.0 modelr_0.1.0 [5] gtools_3.5.0 assertthat_0.2.0 stats4_3.4.0 cellranger_1.1.0 [9] quantreg_5.33 chron_2.3-50 digest_0.6.10 rvest_0.3.2 [13] minqa_1.2.4colorspace_1.3-2 Matrix_1.2-10 plyr_1.8.4 [17] psych_1.7.3.21 XML_3.98-1.7 broom_0.4.2 SparseM_1.77 [21] haven_1.0.0scales_0.4.1 lme4_1.1-13 MatrixModels_0.4-1 [25] mgcv_1.8-17car_2.1-5 nnet_7.3-12 lazyeval_0.2.0 [29] pbkrtest_0.4-7 mnormt_1.5-5 magrittr_1.5 memoise_1.0.0 [33] nlme_3.1-131 forcats_0.2.0 xml2_1.1.1 foreign_0.8-69 [37] tools_3.4.0hms_0.3munsell_0.4.3 compiler_3.4.0 [41] caTools_1.17.1 rlang_0.1.1grid_3.4.0 nloptr_1.0.4 [45] iterators_1.0.8bitops_1.0-6 tcltk_3.4.0 gtable_0.2.0 [49] ModelMetrics_1.1.0 codetools_0.2-15 reshape2_1.4.2 R6_2.2.0 [53] knitr_1.15.1 KernSmo