On Wed, Oct 15, 2008 at 7:51 AM, Gregory Stark <[EMAIL PROTECTED]> wrote: > "Joshua Tolley" <[EMAIL PROTECTED]> writes: > >> I've been interested in what it would take to start tracking >> cross-column statistics. A review of the mailing lists as linked from >> the TODO item on the subject [1] suggests the following concerns: >> >> 1) What information exactly would be tracked? >> 2) How would it be kept from exploding in size? >> 3) For which combinations of columns would statistics be kept? > > I think then you have > > 4) How would we form estimates from these stats
That too, but see below. >> The major concern in #1 seemed to be that the most suitable form for >> keeping most common value lists, histograms, etc. is in an array, and >> at the time of the posts I read, arrays of composite types weren't >> possible. This seems much less of a concern now -- perhaps in greatest >> part because a test I just did against a recent 8.4devel sure makes it >> look like stats on composite type columns aren't even kept. The most >> straightforward is that we'd keep a simple multi-dimensional >> histogram, but that leads to a discussion of #2. > > "multi-dimensional histogram" isn't such a simple concept, at least not to me. > > Histograms aren't a bar chart of equal widths and various heights like I was > taught in school. They're actually bars of various widths arranged such that > they all of the same heights. That depends on whose definition you've got in mind. Wikipedia's version (for whatever that's worth) defines it as variable width columns of not-necessarily-uniform heights, where the area of each bar is the frequency. The default behavior of the hist() function in the R statistics package is uniform bar widths, which generates complaints from purists. Semantics aside, it looks like pgsql's stats stuff (which I realize I'll need to get to know lots better to undertake something like this) uses a list of quantiles, which still only makes sense, as I see it, when a distance measurement is available for the data type. The multidimensional case might do better with a frequency count, where we'd store a matrix/cube/etc. with uniformly-sized units, probably compressed via some regression function. That said, a multidimensional quantile list would also be possible, compressed similarly. Provided a frequency chart or quantile set, it seems estimates can be made just as they are in one dimension, by finding the right value in the quantile or frequency matrix. > It's not clear how to extend that concept into two dimensions. I imagine > there's research on this though. What do the GIST statistics functions store? No idea. - Josh / eggyknap -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers