Josh Berkus <josh@agliodbs.com> writes: > Greg, Tom, > > > a) We already use block based sampling to reduce overhead. If you're > > talking about using the entire block and not just randomly sampled > > tuples from within those blocks then your sample will be biased. > > There are actually some really good equations to work with this, estimating > both row correlation and n-distinct based on sampling complete blocks. > See prior discussions on this list on N-distinct.
In the prior discussions someone posted the paper with the algorithm I mentioned. That paper mentions that previous work showed poor results at estimating n_distinct even with sample sizes as large as 50% or more. > > 1) You have n^2 possible two-column combinations. That's a lot of > > processing and storage. > > Yes, that's the hard problem to solve. Actually, btw, it's n!, not n^2. You have O(n^2) possible *two-column* combinations and O(n!) n-column combinations. I was hoping two-column combinations would be enough information to extrapolate from for larger combinations. > > 2) It isn't even clear what data you're exactly looking for. Certainly > > "correlation" is just shorthand here and isn't what you're actually > > looking for. > > Actually, I'd think that a correlation number estimate (0 = complete > uncorrelated, 1 = completely correlated) would be sufficient to improve > row count estimation significantly, without incurring the vast overhead of > histogramXhistorgram manipulation. No, correlation is actually quite uninteresting here. Consider for example two columns "state" and "area code". The "area code" is pretty much 100% dependent on state, but will show very little correlation. Similarly things like "product name" and "catalog number" or even "author" and "genre" would be problem cases but have little correlation. And you can easily imagine queries that specify restrictive clauses on two such columns for which the existing statistics will overestimate the selectivity because it assumes no interdependency. Hopefully you're right that you don't need complete histograms. Perhaps there's some statistics concept they don't teach in stats 101 that would cover this need. What we're looking for is a function f(a,b,x) that gives the net selectivity given a and b, the selectivity of two clauses based on two columns, and x some simple value that can be easily calculated by looking at the data in advance. -- greg ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend