G'day Rainer, On Sun, 26 Sep 2010 10:29:08 +0200 Rainer M Krug <r.m.k...@gmail.com> wrote:
> I realized that I am simply interested in the pdf p(y) that y > *number* of entities (which ones is irrelevant) in N are are *not* > drawn after the sampling process has been completed. Even simpler (I > guess), in a first step, I would only need the mean number of > expected non-drawn entities in N (pMean). > > But what about the range in between? You can calculate those probilities. Your problem was sounding vaguely familiar to me yesterday but I could not put my finger onto its abstract counterpart. Now with this clarification I can. :) Your set up is exactly the one of lottery draws. In each draw n of N numbers are drawn without replacement. So your question is "what is the probability that I have not seen k of the N numbers after x draws?". This question can easily be answered by using some Markov Chain theory. Let Y_l be the number of entities that you have not seen after the l-th draw, taking possible values 0, 1, ...., N. Obviously, Y_0=N with probability 1, and Y_1=N-n with probability 1. Now, the probability that Y_l, l>=1, is equal to k is given by; 0 if k=N-n+1, N-n+2,..., N and sum_{i=0,...,n} P[Y_l=k | Y_{l-1} = k+i] P[Y_{l-1} = k+i] o/w. P[Y_l=k | Y_{l-1} = k+i] is the probability that the n entities sampled in the l-th draw consists of i entities of the k+i entities that were not seen by draw l-1 and n-i entities of the N-k-i entities that were already seen by draw l-1. This probability is choose(k+i, i)*choose(N-k-i, n-i) / choose(N, n) Note that this probability is independent of l, i.e., Y_0, Y_1, Y_2,... form a stationary Markov Chain. Thus, to answer your question, the strategy would be to calculate the transition matrix once and then start with either the state vector of Y_0 or of Y_1 (both of which are rather trivial as mentioned above) and calculate the state vector of Y_x by repeatedly multiplying the chosen initial state vector with the transition matrix for the appropriate number of times. The transition matrix is (N+1)x(N+1), so it can be rather large, but the number of non-zero entries in the transition matrix is at most (N+1)*(n+1), so depending on the relative magnitude of n and N, sparse matrix techniques might be helpful (or using a language such as C, FOTRAN, ... for the calculations). HTH. Cheers, Berwin ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.