On Sat, 2005-04-23 at 16:39 -0700, Josh Berkus wrote: Greg Stark wrote > > I looked into this a while back when we were talking about changing the > > sampling method. The conclusions were discouraging. Fundamentally, using > > constant sized samples of data for n_distinct is bogus. Constant sized > > samples only work for things like the histograms that can be analyzed > > through standard statistics population sampling which depends on the law of > > large numbers.
ISTM Greg's comments are correct. There is no way to calculate this with consistent accuracy when using a constant sized sample. (If it were, then people wouldnt bother to hold large databases...) > Overall, our formula is inherently conservative of n_distinct. That is, I > believe that it is actually computing the *smallest* number of distinct > values which would reasonably produce the given sample, rather than the > *median* one. This is contrary to the notes in analyze.c, which seem to > think that we're *overestimating* n_distinct. The only information you can determine from a sample is the smallest number of distinct values that would reasonably produce the given sample. There is no meaningful concept of a median one... (You do have an upper bound: the number of rows in the table, but I cannot see any meaning from taking (Nrows+estimatedN_distinct)/2 ). Even if you use Zipf's Law to predict the frequency of occurrence, you'd still need to estimate the parameters for the distribution. Most other RDBMS make optimizer statistics collection an unsampled scan. Some offer this as one of their options, as well as the ability to define the sample size in terms of fixed number of rows or fixed proportion of the table. My suggested hack for PostgreSQL is to have an option to *not* sample, just to scan the whole table and find n_distinct accurately. Then anybody who doesn't like the estimated statistics has a clear path to take. The problem of poorly derived base statistics is a difficult one. When considering join estimates we already go to the trouble of including MFV comparisons to ensure an upper bound of join selectivity is known. If the statistics derived are themselves inaccurate the error propagation touches every other calculation in the optimizer. GIGO. What price a single scan of a table, however large, when incorrect statistics could force scans and sorts to occur when they aren't actually needed ? Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster