On Wed, 6 Feb 2008, Tim Hesterberg wrote: >> Tim Hesterberg wrote: >>> I'll raise a related issue - sampling with unequal probabilities, >>> without replacement. R does the wrong thing, in my opinion: >>> ... >> Peter Dalgaard wrote: >> But is that the right thing? ... > (See bottom for more of the previous messages.) > > > First, consider the common case, where size * max(prob) < 1 -- > sampling with unequal probabilities without replacement. > > Why do people do sampling with unequal probabilities, without > replacement? A typical application would be sampling with probability > proportional to size, or more generally where the desire is that > selection probabilities match some criterion.
In real survey PPS sampling it also matters what the pairwise joint selection probabilities are -- and there are *many* algorithms, with different properties. Yves Till'e has written an R package that implements some of them, and the pps package implements others. > The default S-PLUS algorithm does that. The selection probabilities > at each of step 1, 2, ..., size are all equal to prob, and the overall > probabilities of selection are size*prob. Umm, no, they aren't. Splus 7.0.3 doesn't say explicitly what its algorithm is, but is happy to take a sample of size 10 from a population of size 10 with unequal sampling probabilities. The overall selection probability *can't* be anything other than 1 for each element -- sampling without replacement and proportional to any other set of probabilities is impossible. Even in a milder case -- samples of size 5 from 1:10 with probabilities proportional to 1:10 -- the deviation is noticeable in 1000 replications. In this case sampling with the specified probabilities is actually possible, but S-PLUS doesn't do it. Now, it might be useful to add another replace=FALSE sampler to sample(), such as the newish Conditional Poisson Sampler based on the work of S.X.Chen. This does give correct marginal probabilities of inclusion, and the pairwise joint probabilities are not too hard to compute. I don't think that dropping the current sequential PPS implementation is a good idea. The help page does explain the algorithm, though it might be useful to add an explicit note that the marginal probabilities of sampling are not the supplied probabilities. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.