Hi Peter, H Berwin,

thanks a lot for your clarifications, it makes more sense now. But having
our input and thinking a little bit more about the problem, I realized that
I am simply interested in the pdf p(y) that y *number* of entities (which
ones is irrelevant) in N are are *not* drawn after the sampling process has
been completed. Even simpler (I guess), in a first step, I would only need
the mean number of expected non-drawn entities in N (pMean).

The range of my values:
N is in the range of 1 --- 100 000
x is in the range of 10 --- 40 000 000
n is in the range of 20

I guess in cases where n*x is substantially smaller then N, I could simply
use a binominal distribution for n*x samples to approximate it -- right?
For cases where n*x is substantially bigger then N, I can safely (especially
in the context of my simulation) assume that all entities in N are drawn at
least once.

But what about the range in between?

Thanks again,

Cheers,

Rainer

On Sat, Sep 25, 2010 at 5:19 PM, Peter Dalgaard <pda...@gmail.com> wrote:

> On 09/25/2010 04:24 PM, Rainer M Krug wrote:
> > Hi
> >
> > This is OT, but I need it for my simulation in R.
> >
> > I have a special case for sampling with replacement: instead of sampling
> > once and replacing it immediately, I sample n times, and then replace all
> n
> > items.
> >
> >
> > So:
> >
> > N entities
> > x samples with replacement
> > each sample consists of n sub-samples WITHOUT replacement, which are all
> > replaced before the next sample is drawn
> >
> > My question is: which distribution can I use to describe how often each
> > entity of the N has been sampled?
> >
> > Thanks for your help,
> >
> > Rainer
> >
>
> How did you know I was in the middle of preparing lectures on the
> variance of the hypergeometric distribution and such? ;-)
>
> If you look at a single item, the answer is of course that you have a
> binomial with size=x and prob=n/N. The problem is that these binomials
> are correlated between items.
>
> If you can make do with a 2nd order approximation, then the covariances
> between the indicators for two items being selected is easily found from
> the symmetry and the fact that if you sum all N indicators you get the
> constant n. I.e. the variance is p(1-p) and the covariance is
> -p(1-p)/(N-1). For sums over repeated samples, just multiply everything
> by the number x of samples.
>
> If you intend to just count the frequency of a particular feature in
> each of your n-samples, i.e., you have x replications of a
> hypergeometric experiment, then you can do somewhat better by computing
> the explicit convolution of x hypergeometrics (convolve(x, rev(y),
> type="o") and Reduce() are your friends). I'm not sure this is actually
> worth the trouble, but it should be doable for decent-sized N and x.
>
>
>
> --
> Peter Dalgaard
> Center for Statistics, Copenhagen Business School
> Phone: (+45)38153501
> Email: pd....@cbs.dk  Priv: pda...@gmail.com
>



-- 
NEW GERMAN FAX NUMBER!!!

Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology,
UCT), Dipl. Phys. (Germany)

Centre of Excellence for Invasion Biology
Natural Sciences Building
Office Suite 2039
Stellenbosch University
Main Campus, Merriman Avenue
Stellenbosch
South Africa

Cell:           +27 - (0)83 9479 042
Fax:            +27 - (0)86 516 2782
Fax:            +49 - (0)321 2125 2244
email:          rai...@krugs.de

Skype:          RMkrug
Google:         r.m.k...@gmail.com

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to