On Sun, 17 Aug 2008, Duncan Murdoch wrote:

Shengqiao Li wrote:
Dear all,

Recently I am generating large random samples (10M) and any duplicated numbers are not desired. We tried several RNGs in R and found Wichmann-Hill did not produce duplications.

The duplication problem is the interesting birthday problem. If there are M possible numbers, randomly draw N numbers from them,
the average number of dupilcations D = N(N-1)/2/M.

For Knuth-TAOCP and Knuth-TAOCP-2002, M=2^30, since this modulus is used. D = 46566.12 for N=10M samples.

For Marsaglia-Multicarry, Super-Duper and Mersene-Twister, M=2^32. D = 11641.53 for N = 10M samples.

My testing results (see below) agree with above analysis. But for Wichmann-Hill, it wasn't. Wichmann-Hill's cycle is 6.9536e12 (refer to RNG help by ?RNG and Whichmann's correction in 1984). Thus M <= 6.9536e12. D
= 7.19052 when N=1e7 and D>= 179.763 for N=5e7.
But in my tests, duplications were surprisingly not observed.

It seems that Wichmann-Hill has a much larger cycle than the one documented!

Anybody can solve this puzzle?


As I told you, look at the code. The cycle length of Wichmann-Hill in R appears to be 2.8e13, and that also appears to be M in your notation. Your birthday problem calculation does not apply here. You could probably get a good approximation to it by rounding the Wichmann-Hill output (e.g. look at round(2^30 * runif()), or maybe more severe rounding).

M is larger than what was previously documented, but Brian Ripley has corrected the documentation.

Wichmann claimed 2.78X10^13 in his 1982 original paper. He made correction in 1984, "The period of the generator was incorrectly given as 2.78 X 10^13. In fact its period is only a quarter of that value: 6.95X10^12." In R version 2.7.1, Brian Ripley adopted Wichmann's 1984 correction in RNG document. That is the currently documented cycle is 6.95X10^12 not 2.78X10^13 nor your approxmation 2.8e13.

So, is the cycle claimed in 1982 correct or the one claimed in 1984?

Even if the larger one 2.78e13 claimed in 1982 is correct, around 45 duplications were expected for 50M samples. But I got 0. My testing method is based on the example code in the last third line in RNG help.

Shengqiao Li


Duncan Murdoch
Regards,

Shengqiao Li

Department of Statistics
West Virgina Unversity

==============Testing===================


RNGkind(kind="Knuth-TAOCP");
RNGkind();

[1] "Knuth-TAOCP" "Inversion"

sum(duplicated(runif(1e7)));

[1] 46216

RNGkind(kind="Knuth-TAOCP-2002");
RNGkind();

[1] "Knuth-TAOCP-2002" "Inversion"

sum(duplicated(runif(1e7)));

[1] 46373


RNGkind(kind="Marsaglia-Multicarry");
RNGkind();

[1] "Marsaglia-Multicarry" "Inversion"

sum(duplicated(runif(1e7)));

[1] 11671

RNGkind(kind="Super-Duper");
RNGkind();

[1] "Super-Duper" "Inversion"

sum(duplicated(runif(1e7)));

[1] 11704

RNGkind(kind="Mersenne-Twister");
RNGkind();

[1] "Mersenne-Twister" "Inversion"

sum(duplicated(runif(1e7)));

[1] 11508

RNGkind(kind="Wichmann-Hill");
RNGkind();

[1] "Wichmann-Hill" "Inversion"

sum(duplicated(runif(1e7)));

[1] 0


gc()

           used (Mb) gc trigger  (Mb)  max used   (Mb)
Ncells  199157  5.4     646480  17.3  10149018  271.1
Vcells 4497151 34.4  108714629 829.5 169585390 1293.9


sum(duplicated(runif(5e7)))

[1] 0



sessionInfo()

R version 2.7.1 (2008-06-23)
i386-pc-mingw32

locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] tools_2.7.1

==============End of Testing===================

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to