Re: [HACKERS] Cross-column statistics revisited

Joshua Tolley Wed, 15 Oct 2008 09:14:01 -0700

On Wed, Oct 15, 2008 at 7:51 AM, Gregory Stark <[EMAIL PROTECTED]> wrote:
> "Joshua Tolley" <[EMAIL PROTECTED]> writes:
>
>> I've been interested in what it would take to start tracking
>> cross-column statistics. A review of the mailing lists as linked from
>> the TODO item on the subject [1] suggests the following concerns:
>>
>> 1) What information exactly would be tracked?
>> 2) How would it be kept from exploding in size?
>> 3) For which combinations of columns would statistics be kept?
>
> I think then you have
>
> 4) How would we form estimates from these stats


That too, but see below.

>> The major concern in #1 seemed to be that the most suitable form for
>> keeping most common value lists, histograms, etc. is in an array, and
>> at the time of the posts I read, arrays of composite types weren't
>> possible. This seems much less of a concern now -- perhaps in greatest
>> part because a test I just did against a recent 8.4devel sure makes it
>> look like stats on composite type columns aren't even kept. The most
>> straightforward is that we'd keep a simple multi-dimensional
>> histogram, but that leads to a discussion of #2.
>
> "multi-dimensional histogram" isn't such a simple concept, at least not to me.
>
> Histograms aren't a bar chart of equal widths and various heights like I was
> taught in school. They're actually bars of various widths arranged such that
> they all of the same heights.

That depends on whose definition you've got in mind. Wikipedia's
version (for whatever that's worth) defines it as variable width
columns of not-necessarily-uniform heights, where the area of each bar
is the frequency. The default behavior of the hist() function in the R
statistics package is uniform bar widths, which generates complaints
from purists. Semantics aside, it looks like pgsql's stats stuff
(which I realize I'll need to get to know lots better to undertake
something like this) uses a list of quantiles, which still only makes
sense, as I see it, when a distance measurement is available for the
data type. The multidimensional case might do better with a frequency
count, where we'd store a matrix/cube/etc. with uniformly-sized units,
probably compressed via some regression function. That said, a
multidimensional quantile list would also be possible, compressed
similarly.

Provided a frequency chart or quantile set, it seems estimates can be
made just as they are in one dimension, by finding the right value in
the quantile or frequency matrix.

> It's not clear how to extend that concept into two dimensions. I imagine
> there's research on this though. What do the GIST statistics functions store?

No idea.

- Josh / eggyknap

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cross-column statistics revisited

Reply via email to