On Mon, Jun 29, 2009 at 10:22 PM, Robert Haas<robertmh...@gmail.com> wrote: > I'm finding myself unable to follow all the terminology on this thead. > What's dimension reduction? What's PCA? [snip]
Imagine you have a dataset with two variables, say height in inches and age in years. For tue purpose of discussion lets pretend for a moment that all the people in your sample have height the same as their age. You could create a 2d histogram of your data: |0000000200000000 |0000006000000000 a|0000030000000000 g|0000400000000000 e|0003000000000000 |0010000000000000 |0100000000000000 |---------------- height You could store this 2d histogram as is and use it for all the things you'd use histograms for.... or you could make an observation of the structure and apply a rotation and flattening of the data and convert it to a 1d histogram [0113426200...] which is far more compact. Often data has significant correlation, so it's often possible to reduce the dimensionality without reducing the selectivity of the histogram greatly. This becomes tremendously important as the number of dimensions goes up because the volume of a N dimensional space increases incredibly fast as the number of dimensions increase. PCA is used as one method of dimensionality reduction. In PCA you find a linear transformation (scaling, rotation) of the data that aligns the data so that the axis lines cut through the data-space in the orientations with the greatest variance. I have no clue how you would apply PCA to postgresql histograms, since to build the PCA transform you need to do some non-trivial operations with the data. Perhaps PCA could be done on a random sample of a table, then that transformation could be stored and used to compute the histograms. I'm sure there has been a lot of research on this. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers