Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

Andrew Dunstan Tue, 26 Apr 2005 16:10:35 -0700

Tom Lane wrote:

Josh Berkus <josh@agliodbs.com> writes:

Overall, our formula is inherently conservative of n_distinct. That is, I believe that it is actually computing the *smallest* number of distinct values which would reasonably produce the given sample, rather than the *median* one. This is contrary to the notes in analyze.c, which seem to think that we're *overestimating* n_distinct.
Well, the notes are there because the early tests I ran on that formula
did show it overestimating n_distinct more often than not.  Greg is
correct that this is inherently a hard problem :-(
I have nothing against adopting a different formula, if you can find
something with a comparable amount of math behind it ... but I fear
it'd only shift the failure cases around.

The math in the paper does not seem to look at very low levels of q (= sample to pop ratio).

The formula has a range of [d,N]. It appears intuitively (i.e. I have not done any analysis) that at very low levels of q, as f1 moves down from n, the formula moves down from N towards d very rapidly. I did a test based on the l_comments field in a TPC lineitems table. The test set has N = 6001215, D = 2921877. In my random sample of 1000 I got d = 976 and f1 = 961, for a DUJ1 figure of 24923, which is too low by 2 orders of magnitude.

I wonder if this paper has anything that might help: http://www.stat.washington.edu/www/research/reports/1999/tr355.ps - if I were more of a statistician I might be able to answer :-)

cheers

andrew

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
     subscribe-nomail command to [EMAIL PROTECTED] so that your
     message can get through to the mailing list cleanly

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

Reply via email to