Hi Stephan, Two thoughts: First, on the technical side, the numbers you're getting are correct. They match what you'd get in R, which uses its own math library. Specifically:
> ppois(5, 4, lower.tail=FALSE) [1] 0.2148696 > ppois(56, 47, lower.tail=FALSE) [1] 0.08594085 > ppois(560, 475, lower.tail=FALSE) [1] 6.651191e-05 > ppois(5600, 4755, lower.tail=FALSE) [1] 4.390957e-33 > ppois(56007, 47550, lower.tail=FALSE) [1] 1.468864e-311 > ppois(560072, 475507, lower.tail=FALSE) [1] 0 Here's another check, e.g. for the second line: https://www.wolframalpha.com/input/?i=GammaRegularized%5B57,+0,+47%5D *As far as GSL is concerned, this is all working as expected.* So far, so good. Which brings me to the second point: Why does this not match your intuition? If I understand correctly, you assume that being the same relative distance away from the mean should yield the same probability. E.g. getting 560 instead of expected 475 successes implies a parameter value for p that overshoots the known value by 560/475*p - p = 4.6%. Similarly, getting 5600 instead of 4755 successes overshoots the true parameter p by 4.6%. With the overshoot being constant, why do the probabilities of exceeding by the observed count or more get smaller? (In fact, they get exponentially smaller, and I'm using that term advisedly.) You're using a Poisson approximation to a binomial distribution with large *n*. To remove one source of doubt: the fact that you're using an approximation (you could have used a Normal approximation instead, as John did in his reply) is not a problem. But for clarity, let's plug in the binomial distribution, which forces us to spell out what *n* is: > pbinom(5, 18, p=0.256903, lower.tail=FALSE) [1] 0.3066459 > pbinom(56, 185, p=0.256903, lower.tail=FALSE) [1] 0.06753162 > pbinom(560, 1850, p=0.256903, lower.tail=FALSE) [1] 4.129011e-06 > pbinom(5600, 18509, p=0.256903, lower.tail=FALSE) [1] 1.137108e-44 > pbinom(56007, 185092, p=0.256903, lower.tail=FALSE) [1] 0 But the binomial model is the sum of successes in *n* independent and identically distributed (iid) Bernoulli trials. In your model, the probability of success (getting the letter C) is known and fixed; and each time you draw a letter according to that model, you do so with the same fixed probability, independent of any previous draws. If that model is correct (and in a way you could say you're refuting the null hypothesis that it is correct), you have more of a chance to overshoot the expected value when the number of independent draws is small. You may just get (un)lucky. But as the number of draws increases, you can't continue to be consistently (un)lucky. Most of the values have to be closer to the mean, and that is in fact a function of *n*. Compare and contrast that with your intuition, which said it's not a function of *n*, only a function of the relative excess. In what way does the probability of overshooting the mean depend on the number of draws *n*? One simple answer that provides an upper bound comes from Hoeffding's_inequality <https://en.wikipedia.org/wiki/Hoeffding%27s_inequality>, which states that the sum of bounded iid random variables must be concentrated around the expected value, with exponential tails on both sides. If you use the form of Hoeffding's equality for exceeding the mean of *n* iid Bernoulli variables, you find Pr[H(n) >= (p+ε) n] <= exp(-2 ε^2 n) where H(n) is a random variable counting the number of successes in *n* iid Bernoulli trials, p is your success probability, and we can let ε = k/N-p, i.e. ε is the probability overshoot of around 4.6% that we had seen earlier. Then we find: Pr[H(18) >= 5.44664] <= 0.927607 Pr[H(185) >= 55.9794] <= 0.46193 Pr[H(1850) >= 559.794] <= 0.000442345 Pr[H(18509) >= 5600.66] <= 2.76243e-34 Pr[H(185092) >= 56007.2] <= 0 So the upper bound for the probability of exceeding the mean by a fixed relative amount in *n* iid Bernoulli trials is a function exp(-C n) for a fixed constant C>0, i.e. the tail probability decreases exponentially for increasing *n*. In overly simple terms: In independent coin flips, you can't be consistently more lucky than the true success probability. Everything else being fixed, your luck will run out exponentially the longer you keep going. On Thu, Mar 7, 2019 at 3:04 PM Stephan Lorenzen <stephan.loren...@gmx.de> wrote: > Dear list, > > I have N=1850923 drawings of characters which can be either A,T,C or G. > The probability of C is p=0.256903, so I would expect > lambda=N*p=1850923*0.256903=475508 C's. > Indeed, I got k=560073. > According to my understanding, the probability to get 560073 or more C's > in 1850923 drawings should be gsl_cdf_poisson_Q(k-1, > lambda)=gsl_cdf_poisson_Q(560072, 475508) > I get slightly but not dramatically more than expected, so I would > expect a p-value of, whatwever, 0.2 or so. > However, I got 0. > I would expect the p-value in the same range than when getting 5 instead > of 4 drawings, so I did a series: > > gsl_cdf_poisson_Q( 5, 4) = 0.21487 > gsl_cdf_poisson_Q( 56, 47) = 0.0859409 > gsl_cdf_poisson_Q( 560, 475) = 6.65119e-05 > gsl_cdf_poisson_Q( 5600, 4755) = 4.39096e-33 > gsl_cdf_poisson_Q( 56007, 47550) = 1.46886e-311 > gsl_cdf_poisson_Q(560072, 475507) = 0 > > The results starting from the second line are certainly not what I expect. > Probably I have a fundamental misunderstanding of gsl_cdf_poisson_Q? > What would be the proper way to use it, or which function should I use > instead? > > Thanks for any answers. > > P.S. the actual problem concerns longer sequences with way lower > p-values where the Poisson distribution is more appropriate. I just > picked a simple example with a single letter and thus high p-value > >