>> The proposed solution is based on contingency tables, built for selected >> groups of columns (not for each possible group). And the contingency >> table gives you the ability to estimate the probabilities needed to >> compute the selectivity. Or am I missing something? > > Well, I'm not real familiar with contingency tables, but it seems like > you could end up needing to store a huge amount of data to get any > benefit out of it, in some cases. For example, in the United States, > there are over 40,000 postal codes, and some even larger number of > city names, and doesn't the number of entries go as O(m*n)?
Not to mention that the model parameters will be impossible to estimate well. ( I've only scanned the thread, so sorry if Im repeating something that's already been said ) My intuition is that storing the covariance structure will be unworkable both technically and statistically for all but the simplest of cases. That being said, I dont think the problem is a lack of ways to parameterize the covariance structure ( there are several papers on mutli-dim histogram estimators, at least one of whichtalks about databases explicitly, not to mention approaches like CART[1] ) , but a complete lack of infrastructure to do anything with the estimates. So keep working on it ;-) ( but try to make the framework general enough to allow better estimators ). I wonder if a good first step might be to focus on the AND and OR operators since they seem like the most common cases and union and intersection commute ( although it's important to remember that the estimate variances do NOT ) That is, what about estimating the selectivity of the condition WHERE X=A and Y=B by f(A,B) = x_1*(selectivity(X = A) + selectivity( Y = B )) + x_2*selectivity(X = A)*selectivity( Y = B ) + x_3 where x_{1,2,3} are parameters to be estimated from the data. Another quick note: I think that storing the full contingency table is wasteful since the marginals are already stored in the single column statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was looking at this a couple years back ). Best, Nathan [1] http://en.wikipedia.org/wiki/Classification_and_regression_tree [2] http://en.wikipedia.org/wiki/Copula_%28statistics%29 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers