Re: [HACKERS] multivariate statistics v14

Tomas Vondra Tue, 22 Mar 2016 05:38:32 -0700

Hi,

On 03/22/2016 11:41 AM, Tatsuo Ishii wrote:

Hum. So without 0006 or beyond, there's not much benefit for the
PostgreSQL users, and you are not too confident about 0006 or
beyond. Then I would think it is a little bit hard to justify in
putting 000[2-5] into 9.6. I really like this feature and would
like to see in PostgreSQL someday, but I'm not sure if we should
put the patches (0002-0005) into PostgreSQL now. Please let me
know if there's some reaons we should put the patches into
PostgreSQL now.


I don't think so. While being able to combine multiple statistics
is certainly useful, I'm convinced that the initial patched add
enough


Can you please elaborate a little bit more how combining multiple
statistics is useful?


Sure.

The goal of multivariate statistics is to approximate a probabilitydistribution on a group of columns. The larger the number of columns,the less accurate the statistics will be (with respect to individualcolumns), assuming fixed size of the sample in ANALYZE, and fixedstatistics size.

For example, if you add a column to multivariate histogram, you'll dosome "bucket splits" by this dimension, thus reducing the accuracy forthe other columns. You may of course allow larger statistics (e.g.histograms with more buckets), but that also requires larger samples,and so on.


Now, let's  assume you have a query like this:

    WHERE (a=1) AND (b=2) AND (c=3) AND (d=4)

and that "a" and "b" are correlated, and "c" and "d" are correlated, butthat otherwise the columns are independent. It'd be a bit silly torequire building statistics on (a,b,c,d), when two statistics on each ofthe column pairs would be cheaper and also more accurate.

That's of course a trivial case - independent groups of correlatedcolumns. But I'd say this is actually a pretty common case, and I dobelieve there's not much controversy that we should support it.

Another reason to allow multiple statistics is that columns in one groupmay be a good fit for MCV list (which works well for discrete values),while the other group may be a good candidate for histogram (which workswell for continuous values). This can't be solved by first building aMCV and then a histogram on the group.

The question of course is what to do if the groups are not independent.The patch does that by assuming the statistics overlap, and usesconditions on the columns included in both statistics to combine themusing conditional probabilities. I do believe this works quite well, butthis is perhaps the part that needs further discussion. There are otherways to combine the statistics, but I do expect them to be considerablymore expensive.


Is this a sufficient explanation?

Of course, there's a fair amount of additional complexity that I havenot mentioned here (e.g. selecting the right combination of stats).


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multivariate statistics v14

Reply via email to