Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

pgsql Mon, 07 Feb 2005 10:26:49 -0800

> On Mon, Feb 07, 2005 at 11:27:59 -0500,
>   [EMAIL PROTECTED] wrote:
>>
>> It is inarguable that increasing the sample size increases the accuracy
>> of
>> a study, especially when diversity of the subject is unknown. It is
>> known
>> that reducing a sample size increases probability of error in any poll
>> or
>> study. The required sample size depends on the variance of the whole. It
>> is mathmatically unsound to ASSUME any sample size is valid without
>> understanding the standard deviation of the set.
>
> For large populations the accuracy of estimates of statistics based on
> random
> samples from that population are not very sensitve to population size and
> depends primarily on the sample size. So that you would not expect to need
> to use larger sample sizes on larger data sets for data sets over some
> minimum size.


That assumes a fairly low standard deviation. If the standard deviation is
low, then a minimal sample size works fine. If there was zero deviation in
the  data, then a sample of one works fine.

If the standard deviation is high, then you need more samples. If you have
a high standard deviation and a large data set, you need more samples than
you would need for a smaller data set.

In the current implementation of analyze.c, the default is 100 samples. On
a table of 10,000 rows, that is probably a good number characterize the
data enough for the query optimizer (1% sample). For a table with 4.6
million rows, that's less than 0.002%

Think about an iregularly occuring event, unevenly distributed throughout
the data set. A randomized sample strategy normalized across the whole
data set with too few samples will mischaracterize the event or even miss
it altogether.

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

Reply via email to