> On Mon, Feb 07, 2005 at 11:27:59 -0500, > [EMAIL PROTECTED] wrote: >> >> It is inarguable that increasing the sample size increases the accuracy >> of >> a study, especially when diversity of the subject is unknown. It is >> known >> that reducing a sample size increases probability of error in any poll >> or >> study. The required sample size depends on the variance of the whole. It >> is mathmatically unsound to ASSUME any sample size is valid without >> understanding the standard deviation of the set. > > For large populations the accuracy of estimates of statistics based on > random > samples from that population are not very sensitve to population size and > depends primarily on the sample size. So that you would not expect to need > to use larger sample sizes on larger data sets for data sets over some > minimum size.
That assumes a fairly low standard deviation. If the standard deviation is low, then a minimal sample size works fine. If there was zero deviation in the data, then a sample of one works fine. If the standard deviation is high, then you need more samples. If you have a high standard deviation and a large data set, you need more samples than you would need for a smaller data set. In the current implementation of analyze.c, the default is 100 samples. On a table of 10,000 rows, that is probably a good number characterize the data enough for the query optimizer (1% sample). For a table with 4.6 million rows, that's less than 0.002% Think about an iregularly occuring event, unevenly distributed throughout the data set. A randomized sample strategy normalized across the whole data set with too few samples will mischaracterize the event or even miss it altogether. ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster