Josh Berkus <josh@agliodbs.com> writes: > Tom, > > > In general, estimating n-distinct from a sample is just plain a hard > > problem, and it's probably foolish to suppose we'll ever be able to > > do it robustly. What we need is to minimize the impact when we get > > it wrong. > > Well, I think it's pretty well proven that to be accurate at all you need > to be able to sample at least 5%, even if some users choose to sample > less. Also I don't think anyone on this list disputes that the current > algorithm is very inaccurate for large tables. Or do they?
I think it's worse than that. It's proven that to get any kind of accuracy in this kind of estimate you basically have to look at the entire table. Someone posted a URL to a paper that had a very clever way of estimating distinct values. It required scanning the entire table but only kept a sample of the values found using a method that guaranteed the sample was representative not of the entire table but of the distinct values. I think you're right that a reasonable sample size for this kind of estimate is going to be proportional to the table size, not the constant sized sample that regular statistics need. However that same paper has some preliminary verbiage about how previous papers had found that the sample sizes needed were unacceptably large. Something like 90%. Unfortunately I can't find the reference to the paper this moment. I'll look again later. I think it would be interesting to get the algorithm for sampling from that paper in. It would only be able to be used on a VACUUM ANALYZE but currently VACUUMs are necessary anyways. I do wonder what would happen when we get the incremental VACUUMs and the bitmaps to avoid vacuuming pages unnecessarily though. Then it would be less useful. -- greg ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match