> On 25 Aug 2017, at 11:23 , Jari Oksanen <jari.oksa...@oulu.fi> wrote: > > It is not about "really arge total number of observations", but: > > set.seed(4711);tabs <- r2dtable(1e6, c(2, 2), c(2, 2)); A11 <- vapply(tabs, > function(x) x[1, 1], numeric(1));table(A11) > > A11 > 0 1 2 > 166483 666853 166664 > > There are three possible matrices, and these come out in proportions 1:4:1, > the one with all cells filled with ones being > most common.
... and > dhyper(0:2,2,2,2) [1] 0.1666667 0.6666667 0.1666667 > dhyper(0:2,2,2,2) *6 [1] 1 4 1 so that is exactly what you would expect. However, > dhyper(10782,209410, 276167, 25000) [1] 0.005230889 so you wouldn't expect 10782 to recur. It is close to the mean of the hypergeometric distribution, but > sim <- rhyper(1e6,209410, 276167, 25000) > mean(sim) [1] 10781.53 > sd(sim) [1] 76.14209 (and incidentally, rhyper is plenty fast enough that you don't really need r2dtable for the 2x2 case) -pd > > Cheers, Jari O. > ________________________________________ > From: R-devel <r-devel-boun...@r-project.org> on behalf of Martin Maechler > <maech...@stat.math.ethz.ch> > Sent: 25 August 2017 11:30 > To: Gustavo Fernandez Bayon > Cc: r-devel@r-project.org > Subject: Re: [Rd] Are r2dtable and C_r2dtable behaving correctly? > >>>>>> Gustavo Fernandez Bayon <gba...@gmail.com> >>>>>> on Thu, 24 Aug 2017 16:42:36 +0200 writes: > >> Hello, >> While doing some enrichment tests using chisq.test() with simulated >> p-values, I noticed some strange behaviour. The computed p-value was >> extremely small, so I decided to dig a little deeper and debug >> chisq.test(). I noticed then that the simulated statistics returned by the >> following call > >> tmp <- .Call(C_chisq_sim, sr, sc, B, E) > >> were all the same, very small numbers. This, at first, seemed strange to >> me. So I decided to do some simulations myself, and started playing around >> with the r2dtable() function. Problem is, using my row and column >> marginals, r2dtable() always returns the same matrix. Let's provide a >> minimal example: > >> rr <- c(209410, 276167) >> cc <- c(25000, 460577) >> ms <- r2dtable(3, rr, cc) > >> I have tested this code in two machines and it always returned the same >> list of length three containing the same matrix three times. The repeated >> matrix is the following: > >> [[1]] >> [,1] [,2] >> [1,] 10782 198628 >> [2,] 14218 261949 > >> [[2]] >> [,1] [,2] >> [1,] 10782 198628 >> [2,] 14218 261949 > >> [[3]] >> [,1] [,2] >> [1,] 10782 198628 >> [2,] 14218 261949 > > Yes. You can also do > > unique(r2dtable(100, rr, cc)) > > and see that the result is constant. > > I'm pretty sure this is still due to some integer overflow, > > in spite of the fact that I had spent quite some time to fix > such problem in Dec 2003, see the 14 years old bug PR#5701 > https://bugs.r-project.org/bugzilla/show_bug.cgi?id=5701#c2 > > It has to be said that this is based on an algorithm published > in 1981, specifically - from help(r2dtable) - > > Patefield, W. M. (1981) Algorithm AS159. An efficient method of > generating r x c tables with given row and column totals. > _Applied Statistics_ *30*, 91-97. > > For those with JSTOR access (typically via your University), > available at http://www.jstor.org/stable/2346669 > > When I start reading it, indeed the algorithm seems start from the > expected value of a cell entry and then "explore from there"... > and I do wonder if there is not a flaw somewhere in the > algorithm: > > I've now found that a bit more than a year ago, 'paljenczy' found on SO > > https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated > that indeed the generated tables seem to be too much around the mean. > Basically his example: > > https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated > > >> set.seed(1); system.time(tabs <- r2dtable(1e6, c(100, 100), c(100, 100))); >> A11 <- vapply(tabs, function(x) x[1, 1], numeric(1)) > user system elapsed > 0.218 0.025 0.244 >> table(A11) > > 34 35 36 37 38 39 40 41 42 43 > 2 17 40 129 334 883 2026 4522 8766 15786 > 44 45 46 47 48 49 50 51 52 53 > 26850 42142 59535 78851 96217 107686 112438 108237 95761 78737 > 54 55 56 57 58 59 60 61 62 63 > 59732 41474 26939 16006 8827 4633 2050 865 340 116 > 64 65 66 67 > 38 13 7 1 >> > > For a 2x2 table, there's really only one degree of freedom, > hence the above characterizes the full distribution for that > case. > > I would have expected to see all possible values in 0:100 > instead of such a "normal like" distribution with carrier only > in [34, 67]. > > There are newer publications and maybe algorithms. > So maybe the algorithm is "flawed by design" for really large > total number of observations, rather than wrong > Seems interesting ... > > Martin Maechler > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd....@cbs.dk Priv: pda...@gmail.com ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel