On 23-Jul-09 17:59:56, Jim Bouldin wrote: > Dan Nordlund wrote: > "It would be necessary to see the code for your 'brief test' > before anyone could meaningfully comment on your results. > But your results for a single test could have been a valid > "random" result." > > I've re-created what I did below. The problem appears to be > with the weighting process: the unweighted sample came out much > closer to the actual than the weighted sample (>1% error) did. > Comments? > Jim > >> x > [1] 1 2 3 4 5 6 7 8 9 10 11 12 >> weights > [1] 1 1 1 1 1 1 2 2 2 2 2 2 > >> a = mean(replicate(1000000,(sample(x, 3, prob = weights))));a # (1 > million samples from x, of size 3, weighted by "weights"; the mean > should > be 7.50) > [1] 7.406977 >> 7.406977/7.5 > [1] 0.987597 > >> b = mean(replicate(1000000,(sample(x, 3))));b # (1 million samples >> from > x, of size 3, not weighted this time; the mean should be 6.50) > [1] 6.501477 >> 6.501477/6.5 > [1] 1.000227 > > Jim Bouldin, PhD
To obtain the result you expected, you would need to explicitly specify "replace=TRUE", since the default for "replace" is FALSE. (though probably what you really intended was sampling without replacement). Read carefully what is said about "prob" in '?sample' -- when replace=FALSE, the probability of inclusion of an element is not proportional to its weight in 'prob'. The reason is that elements with higher weights are more likely to be chosen early on. This then knocks that higher weight out of the contest, making it more likely that elements with smaller weights will be sampled subsequently. Hence the mean of the sample will be biased slightly downwards, relative to the weighted mean of the values in x. table(replicate(1000000,(sample(x, 3)))) # 1 2 3 4 5 6 # 250235 250743 249603 250561 249828 249777 # 7 8 9 10 11 12 # 249780 250478 249591 249182 249625 250597 (so all nice equal frequencies) table(replicate(1000000,(sample(x, 3,prob=weights)))) # 1 2 3 4 5 6 # 174873 175398 174196 174445 173240 174110 # 7 8 9 10 11 12 # 325820 326140 325289 325098 325475 325916 Note that the frequencies of the values with weight=2 are a bit less than twice the frequencies of the values with weight=1: (325820+326140+325289+325098+325475+325916)/ (174873+175398+174196+174445+173240+174110) # [1] In fact this is fairly easily caluclated. The possible combinations (in order of sampling) of the two weights, with their probabilities, are: 1s 2s ------------------------------------------- 1 1 1 P = 6/18 * 5/17 * 4/16 3 0 1 1 2 P = 6/18 * 5/17 * 12/16 2 1 1 2 1 P = 6/18 * 12/17 * 5/15 2 1 1 2 2 P = 6/18 * 12/17 * 10/15 1 2 2 1 1 P = 12/18 * 6/16 * 5/15 2 1 2 1 2 P = 12/18 * 6/16 * 10/15 1 2 2 2 1 P = 12/18 * 10/16 * 6/14 1 2 2 2 2 P = 12/18 * 10/16 * 8/14 0 3 So the expected number of "weight=1" in the sample is 3*(6/18 * 5/17 * 4/16) + 2*(6/18 * 5/17 * 12/16) + 2*(6/18 * 12/17 * 5/15) + 1*(6/18 * 12/17 * 10/15) + 2*(12/18 * 6/16 * 5/15) + 1*(12/18 * 6/16 * 10/15) + 1*(12/18 * 10/16 * 6/14) + 0 = 1.046218 Hence the expected number of "weight=2" in the sample is 3 - 1.046218 = 1.953782 and their ratio 1.953782/1.046218 = 1.867471 Compare this with the value 1.867351 (above) obtained by simulation! Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <ted.hard...@manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 23-Jul-09 Time: 21:05:07 ------------------------------ XFMail ------------------------------ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.