Re: [HACKERS] Multi-Dimensional Histograms

Robert Haas Mon, 29 Jun 2009 19:23:07 -0700

On Mon, Jun 29, 2009 at 8:17 PM, Nathan Boley<[email protected]> wrote:
> On Mon, Jun 29, 2009 at 3:43 PM, Tom Lane<[email protected]> wrote:
>> David Fetter <[email protected]> writes:
>>> On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote:
>>>> ... They dismiss
>>>> singular value decomposition and the discrete wavelet transform as
>>>> being too parametric ( which is silly, IMHO )
>>
>>> Should we have a separate discussion about eigenvalues?  Wavelets?
>>
>> I think it'd be a short discussion: what will you do with non-numeric
>> datatypes? We probably don't really want to assume anything stronger
>> than that the datatype has a total ordering.
>
> Well, in the general case, we could use their ranks.
>
> At the end of the day, we cant do any dimension reduction unless the
> ordering encodes some sort of useful information, and the data type
> being in R^n is certainly no guarantee. Consider, for instance, the
> cross correlation of zip-codes and area codes - you would really want
> to order those by some geographic relation. I think that is why
> cross-column stats is so hard in the general case.
>
> That being said, for geographic data in particular, PCA or similar
> could work well.


I'm finding myself unable to follow all the terminology on this thead.
 What's dimension reduction?  What's PCA?

Based on my last few months of answering questions on -performance,
and my own experience, it seems like a lot of the cases that arise in
practice are those where there is a WHERE clause of the form:

colA = constA and colB op constB

...and it sometimes turns out that the subset of the data where colA =
constA has a very different distribution for colB than the data as a
whole, leading to bad plans.  In many cases, it seems like colA is
storing some discrete type of thing, like a customer ID, so the
distribution of colB where colA = constA tells you nothing about the
distribution of colB where colA = constA + someSmallDeltaA.  It feels
like what you might need is statistics for colB (MCVs and/or a
histogram) for certain particular values of colA.  Unfortunately, in
the general case the set of values of colA for which you need these
statistics might be inconveniently large.

...Robert

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multi-Dimensional Histograms

Reply via email to