On 13 March 2018 at 08:39, John Naylor <jcnay...@gmail.com> wrote: >> Also, this is common enough that in fact that distribution >> can be reasonably approximated by a normal distribution. > > For the archives, because it's typically seen 10 times in the sample, > per the rule of thumb mention upthread? >
Actually, I think that the rule of thumb of at least 10 instances in the sample for a normal approximation is probably too strict. Looking around, values as low as 5 seem to be quite commonly used. Of course there is no right answer here, it depends on what you're doing with it, and how close you want the normal distribution to be to the hypergeometric one. For our purposes, I don't think we really need it to be that close. We just want to be able to justify statements like "the value is *probably* more common than X, and it was unlikely that that was just random sampling error". >> Making use of >> the non-MCV-based estimate allows us to do better -- if the MCV-based >> estimate is statistically significantly higher than the non-MCV-based >> estimate, then it would almost certainly be better to include that >> value in the MCV list, even if its spread is quite large. > > Because we can treat the sample as normal, testing for > 2 standard > deviations means that we're "~95% sure" it's more common in the table > than the non-MCV selectivity would suggest, right? (I realize I'm not > using technically accurate language) > Actually, it's more like 97.5% (remember the normal distribution is symmetric, so there's around 2.5% below the 2-stddev interval, around 95% in it, and another 2.5% above it), but that's just nit-picking. There are several nice online calculators for playing around with hypergeometric distributions, such as http://keisan.casio.com/exec/system/1180573201 Consider this distribution: N = 1000000 n = 10000 K = 500 Mean = n * K / N = 5 Variance = (K / N) * (1 - K / N) * n * (N - n) / (N - 1) = 4.9 Stddev = sqrt(Variance) = 2.2 Using the calculator above, you can see that the distribution is fairly normal-like, but with a noticeable positive skew. The 2-stddev interval is 0.6 to 9.4, and according to the calculator the probability of the value being less than or equal to 1 is in the ballpark of the 2.5% figure expected. So even with just 5 occurrences in the sample, it's fairly close to a normal distribution. Regards, Dean