On Wed, Jul 14, 2010 at 01:21:19PM +0200, Yeb Havinga wrote: > Heikki Linnakangas wrote: >> However, the problem is how to represent and store the >> cross-correlation. For fields with low cardinality, like "gender" and >> boolean "breast-cancer-or-not" you can count the prevalence of all the >> different combinations, but that doesn't scale. Another often cited >> example is zip code + street address. There's clearly a strong >> correlation between them, but how do you represent that? >> >> For scalar values we currently store a histogram. I suppose we could >> create a 2D histogram for two columns, but that doesn't actually help >> with the zip code + street address problem. > In my head the neuron for 'principle component analysis' went on while > reading this. Back in college it was used to prepare input data before > feeding it into a neural network. Maybe ideas from PCA could be helpful?
I've been playing off and on with an idea along these lines, which builds an empirical copula[1] to represent correlations between columns where there exists a multi-column index containing those columns. This copula gets stored in pg_statistic. There are plenty of unresolved questions (and a crash I introduced and haven't had time to track down), but the code I've been working on is here[2] in the multicolstat branch. Most of the changes are in analyze.c; no user-visible changes have been introduced. For that matter, there aren't any changes yet actually to use the values once calculated (more unresolved questions get in the way there), but it's a start. [1] http://en.wikipedia.org/wiki/Copula_(statistics) [2] http://git.postgresql.org/gitweb?p=users/eggyknap/postgres.git -- Joshua Tolley / eggyknap End Point Corporation http://www.endpoint.com
signature.asc
Description: Digital signature