Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

Greg Stark Wed, 04 Jan 2006 16:22:43 -0800

Josh Berkus <josh@agliodbs.com> writes:

> Tom,
> 
> > In general, estimating n-distinct from a sample is just plain a hard
> > problem, and it's probably foolish to suppose we'll ever be able to
> > do it robustly.  What we need is to minimize the impact when we get
> > it wrong.  
> 
> Well, I think it's pretty well proven that to be accurate at all you need 
> to be able to sample at least 5%, even if some users choose to sample 
> less.   Also I don't think anyone on this list disputes that the current 
> algorithm is very inaccurate for large tables.  Or do they?


I think it's worse than that. It's proven that to get any kind of accuracy in
this kind of estimate you basically have to look at the entire table.

Someone posted a URL to a paper that had a very clever way of estimating
distinct values. It required scanning the entire table but only kept a sample
of the values found using a method that guaranteed the sample was
representative not of the entire table but of the distinct values.

I think you're right that a reasonable sample size for this kind of estimate
is going to be proportional to the table size, not the constant sized sample
that regular statistics need. However that same paper has some preliminary
verbiage about how previous papers had found that the sample sizes needed were
unacceptably large. Something like 90%.

Unfortunately I can't find the reference to the paper this moment. I'll look
again later.

I think it would be interesting to get the algorithm for sampling from that
paper in. It would only be able to be used on a VACUUM ANALYZE but currently
VACUUMs are necessary anyways. I do wonder what would happen when we get the
incremental VACUUMs and the bitmaps to avoid vacuuming pages unnecessarily
though. Then it would be less useful.

-- 
greg


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

Reply via email to