On Thu, 2006-01-05 at 00:33 -0500, Greg Stark wrote: > Simon Riggs <[EMAIL PROTECTED]> writes: > > > The approach I suggested uses the existing technique for selecting > > random blocks, then either an exhaustive check on all of the rows in a > > block or the existing random row approach, depending upon available > > memory. We need to check all of the rows in a reasonable sample of > > blocks otherwise we might miss clusters of rows in large tables - which > > is the source of the problems identified. > > > > The other reason was to increase the sample size, which is a win in any > > form of statistics. > > Only if your sample is random and independent. The existing mechanism tries > fairly hard to ensure that every record has an equal chance of being selected. > If you read the entire block and not appropriate samples then you'll introduce > systematic sampling errors. For example, if you read an entire block you'll be > biasing towards smaller records.
Yes, I discussed that, following Brutlag & Richardson [2000]. The bottom line is if there is no clustering, block sampling is random, which is good; if there is clustering, then you spot it, which is good. > I think it would be useful to have a knob to increase the sample size > separately from the knob for the amount of data retained in the statistics > tables. Though I think you'll be disappointed and find you have to read an > unreasonably large sample out of the table before you get more useful distinct > estimates. OK, I'll look at doing that. Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings