Re: [HACKERS] multivariate statistics / patch v7

Tomas Vondra Tue, 07 Jul 2015 12:44:21 -0700

Hi,

On 07/07/2015 08:05 AM, Kyotaro HORIGUCHI wrote:

Hi, Tomas. I'll kick the gas pedal.

Thank you, it looks clearer. I have some comment for the brief look
at this. This patchset is relatively large so I will comment on
"per-notice" basis.. which means I'll send comment before examining
the entire of this patchset. Sorry in advance for the desultory
comments.


Sure. If you run into something that's not clear enough, I'm happy to
explain that (I tried to cover all the important details in the
comments, but it's a large patch, indeed.)

- Single-variate stats have a mechanism to inject arbitrary
    values as statistics, that is, get_relation_stats_hook and the
    similar stuffs. I want the similar mechanism for multivariate
    statistics, too.


Fair point, although I'm not sure where should we place the hook,
how exactly should it be defined and how useful that would be in
the end. Can you give an example of how you'd use such hook?

...


We should carefully design the API to be able to point the pertinent
stats for every situation. Mv stats is based on the correlation of
multiple columns so I think only relid and attributes list are
enough as the parameter.

| if (st.relid == param.relid && bms_equal(st.attnums, param.attnums))
|    /* This is the stats to be wanted  */

If we can filter the appropriate stats from all the stats using
clauselist, we definitely can make the appropriate parameter (column
set) prior to retrieving mv statistics. Isn't it correct?

Let me briefly explain how the current clauselist_selectivityimplementation works.


  (1) check if there are multivariate statistics on the table - if not,
      skip the multivariate parts altogether (the point of this is to
      minimize impact on users who don't use the new feature)

  (2) see if the are clauses compatible with multivariate stats - this
      only checks "general compatibility" without actually checking the
      existing stats (the point is to terminate early, if the clauses
      are not compatible somehow - e.g. if the clauses reference only a
      single attribute, use unsupported operators etc.)

  (3) if there are multivariate stats and compatible clauses, the
      function choose_mv_stats tries to find the best combination of
      multivariate stats with respect to the clauses (details later)

  (4) the clauses are estimated using the stats, the remaining clauses
      are estimated using the current statistics (single attribute)

The only way to reliably inject new stats is by calling a hook before(1), allowing it to arbitrarily modify the list of stats. Based on theuse cases you provided, I don't think it makes much sense to addadditional hooks in the other phases.

At this place it's however now known what clauses are compatible withmultivariate stats, or what attributes they are referencing. It might bepossible to simply call pull_varattnos() and pass it to the hook, exceptthat does not work with RestrictInfo :-/

Or maybe we could / should not put the hook into clauselist_selectivitybut somewhere else? Say, to get_relation_info where we actually read thelist of stats for the relation?

0001:

- I also don't think it is right thing for expression_tree_walker
    to recognize RestrictInfo since it is not a part of expression.


Yes. In my working git repo, I've reworked this to use the second
option, i.e. adding RestrictInfo pull_(varno|varattno)_walker:

https://github.com/tvondra/postgres/commit/2dc79b914c759d31becd8ae670b37b79663a595f

Do you think this is the correct solution? If not, how to fix it?


The reason why I think it is not appropreate is that RestrictInfo
is not a part of expression.

Increasing selectivity of a condition by column correlation is
occurs only for a set of conjunctive clauses. OR operation
devides the sets. Is it agreeable? RestrictInfos can be nested
each other and we should be aware of the AND/OR operators. This
is what expression_tree_walker doesn't.

I still don't understand why you think we need to differentiate betweenAND and OR operators. There's nothing wrong with estimating OR clausesusing multivariate statistics.


Perhaps we should provide the dedicate function such like
find_conjunctive_attr_set which does this,

Perhaps. The reason why I added support for RestrictInfo into theexisting walker implementations is that it seemed like the easiest wayto fix the issue. But if there are reasons why that's incorrect, theninventing a new function is probably the right way.


- Check the type top expression of the clause

   - If it is a RestrictInfo, check clause_relids then check
     clause.

   - If it is a bool OR, stop to search and return empty set of
     attributes.

   - If it is a bool AND, make further check of the components. A
     list of RestrictInfo should be treaed as AND connection.

   - If it is operator exression, collect used relids and attrs
     walking the expression tree.

I should missing something but I think the outline is correct.

As I said before, there's nothing wrong with estimating OR clauses usingmultivariate statistics. So OR and AND should be handled exactly the same.

I think you're missing the fact that it's not enough to look at therelids from the RestrictInfo - we need to actually check what clausesare used inside, i.e. we need to check the clauses.

That's because only some of the clauses are compatible with multivariatestats, and only if all the clauses of the BoolExpr are "compatible" thenwe can estimate the clause as a whole. If it's a mix of supported andunsupported clauses, we can simply pass it to clauselist_selectivitywhich will repeat the whole process with.

Addition to that we should carefully avoid duplicate correction
using the same mv statistics.


Sure. That's what choose_mv_statistics does.


I haven't understood what choose_mv_satistics precisely but I
suppose what this function does would be split into the 'making
parameter to find stats' part and 'matching the parameter with
stats in order to retrieve desired stats' part. Could you
reconstruct this process into the form like this?

The goal of choose_mv_statistics does is very simple - given a list ofclauses, it tries to find the best combination of statistics, exploitingas much information as possible.


So let's say you have clauses

   WHERE a=1 AND b=1 AND c=1 AND d=1

but you only have statistics on [a,b], [b,c] and [b,c,d].

The simplest approach would be to use the 'largest' statistics, coveringthe most columns from the clauses - in this case [b,c,d]. This is whatthe initial patches do.

The last patch improves this significantly, by combining the statisticsusing conditional probability. In this case it'd probably use all threestatistics, effectively decomposing the selectivity like this:


  P(a=1,b=1,c=1,d=1) = P(a=1,b=1) * P(c=1|b=1) * P(d=1|b=1,c=1)
                         [a,b]         [b,c]        [b,c,d]

And each of those probabilities can be estimated using one of the stats.

I feel it is too invasive, or exccesively intermix(?)ed.

I don't think it really fits your model - the hook has to be called muchsooner, effectively at the very beginning of the clauselist_selectivityor even before that. Otherwise it might not get called at all (e.g. ifthere are no multivariate stats on the table, this whole part will beskipped).

Why should it stop at disjunctions? There's nothing wrong with using
multivariate stats to estimate OR-clauses, IMHO.


Mv statistics represents how often *every combination of the
column values* occurs. Is it correct? Where the combination can
be replaced with coexists, that is AND. For example MV-MCV.

(a, b, c) freq
(1, 2, 3)  100
(1, 2, 5)   50
(1, 3, 8)   20
(1, 7, 2)    5
===============
total      175

| select * from t where a = 1 and b = 2 and c = 3;
| SELECT 100

This is correct,

| select * from t where a = 1 and b = 2 or c = 3;
| SELECT 100

This is *not* correct. The correct number of tuples is 150.
This is a simple example where OR breaks MV stats assumption.


No, it does not.

I'm not sure where are the numbers coming from, though. So let's see howthis actually works with multivariate statistics. I'll create a tablewith the 4 combinations you used in your example, but with 1000x morerows, to make the estimates a bit more accurate:


   CREATE TABLE  t (a INT, b INT, c INT);

   INSERT INTO t SELECT 1, 2, 3 FROM generate_series(1,100000);
   INSERT INTO t SELECT 1, 2, 5 FROM generate_series(1,50000);
   INSERT INTO t SELECT 1, 3, 8 FROM generate_series(1,20000);
   INSERT INTO t SELECT 1, 7, 2 FROM generate_series(1,5000);

   ALTER TABLE t ADD STATISTICS (mcv) ON (a,b,c);

   ANALYZE t;

And now let's see the two queries:

EXPLAIN select * from t where a = 1 and b = 2 and c = 3;
                        QUERY PLAN
----------------------------------------------------------
 Seq Scan on t  (cost=0.00..4008.50 rows=100403 width=12)
   Filter: ((a = 1) AND (b = 2) AND (c = 3))
(2 rows)

EXPLAIN select * from t where a = 1 and b = 2 or c = 3;
                        QUERY PLAN
----------------------------------------------------------
 Seq Scan on t  (cost=0.00..4008.50 rows=150103 width=12)
   Filter: (((a = 1) AND (b = 2)) OR (c = 3))
(2 rows)

So the first query estimates 100k rows, the second one 150k rows.Exactly as expected, because MCV lists are discrete, match perfectly thedata and behave exactly like your mental model.

If you try this with histograms though, you'll get the same estimate inboth cases:


    ALTER TABLE t DROP STATISTICS ALL;
    ALTER TABLE t ADD STATISTICS (histogram) ON (a,b,c);
    ANALYZE t;

EXPLAIN select * from t where a = 1 and b = 2 and c = 3;
                       QUERY PLAN
---------------------------------------------------------
 Seq Scan on t  (cost=0.00..4008.50 rows=52707 width=12)
   Filter: ((a = 1) AND (b = 2) AND (c = 3))
(2 rows)

EXPLAIN select * from t where a = 1 and b = 2 or c = 3;
                       QUERY PLAN
---------------------------------------------------------
 Seq Scan on t  (cost=0.00..4008.50 rows=52707 width=12)
   Filter: (((a = 1) AND (b = 2)) OR (c = 3))
(2 rows)

That's unfortunate, but it has nothing to do with some assumptions ofmultivariate statistics. The "problem" is that histograms are naturallyfuzzy, and both conditions hit the same bucket.


The solution is simple - don't use histograms for such discrete data.

====
   =# CREATE TABLE t1 (a int, b int, c int);
   =# INSERT INTO t1 (SELECT a, a * 2, a * 3 FROM generate_series(0,
   9999) a);
   =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
    Seq Scan on t1  (cost=0.00..230.00 rows=1 width=12)
   =# ALTER TABLE t1 ADD STATISTICS (HISTOGRAM) ON (a, b, c);
   =# ANALZYE t1;
   =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
    Seq Scan on t1  (cost=0.00..230.00 rows=268 width=12)
====
   Rows changed unwantedly.


That has nothing to do with OR clauses, but rather with using a
type of statistics that does not fit the data and queries.
Histograms are quite inaccurate for discrete data and equality
conditions - in this case the clauses probably match one bucket,
and so we use 1/2 the bucket as an estimate. There's nothing wrong
with that.

So let's use MCV instead:


Hmm, it's not a problem what specific number is displayed as
rows. What is crucial is the fact that rows has changed even
though it shouldn't have changed. As I demonstrated above.

Again, that has nothing to do with any assumptions, and it certainlydoes not demonstrate that OR clauses should not be handled bymultivariate statistics.


In this case, you're observing two effects.

  (1) Natural inaccuracy of histograms when used for discrete data,
      especially in combination with equality conditions (because
      that's impossible to estimate accurately with histograms).

  (2) The original estimate (without multivariate statistics) is only
      seemingly accurate, because it falsely assumes independence.
      It simply assumes that each condition matches 1/10000 of the
      table, and multiplies that, getting ~0.00001 row estimate. This
      is rounded up to 1, which is accidentally the exact value.

Let me demonstrate this on two examples - one with discrete data, onewith continuous distribution.


1) discrete data

    CREATE TABLE t (a INT, b INT, c INT);
    INSERT INTO t  SELECT i/1000, 2*(i/1000), 3*(i/1000)
                     FROM generate_series(1, 1000000) s(i);
    ANALYZE t;

    -- no multivariate stats (so assumption of independence)

    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 and c = 3;

    Seq Scan on t  (cost=0.00..22906.00 rows=1 width=12)
                   (actual time=0.290..59.120 rows=1000 loops=1)

    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 3;

    Seq Scan on t  (cost=0.00..22906.00 rows=966 width=12)
                   (actual time=0.434..117.643 rows=1000 loops=1)

    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 6;

    Seq Scan on t  (cost=0.00..22906.00 rows=966 width=12)
                   (actual time=0.433..96.956 rows=2000 loops=1)

    -- now let's add a histogram

    ALTER TABLE t ADD STATISTICS (histogram) on (a,b,c);
    ANALYZE t;

    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 and c = 3;

    Seq Scan on t  (cost=0.00..22906.00 rows=817 width=12)
                   (actual time=0.268..116.318 rows=1000 loops=1)

    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 3;

    Seq Scan on t  (cost=0.00..22906.00 rows=30333 width=12)
                   (actual time=0.435..93.232 rows=1000 loops=1)

    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 6;

    Seq Scan on t  (cost=0.00..22906.00 rows=30333 width=12)
                   (actual time=0.434..122.930 rows=2000 loops=1)

    -- now let's use a MCV list

    ALTER TABLE t DROP STATISTICS ALL;
    ALTER TABLE t ADD STATISTICS (mcv) on (a,b,c);
    ANALYZE t;

    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 and c = 3;

    Seq Scan on t  (cost=0.00..22906.00 rows=767 width=12)
                   (actual time=0.268..70.604 rows=1000 loops=1)

    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 3;

    Seq Scan on t  (cost=0.00..22906.00 rows=767 width=12)
                   (actual time=0.268..70.604 rows=1000 loops=1)

    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 6;

    Seq Scan on t  (cost=0.00..22906.00 rows=1767 width=12)
                   (actual time=0.428..100.607 rows=2000 loops=1)

The default estimate of AND query is rather bad. For OR clause, it's notthat bad (the OR selectivity is not that bad when it comes todependency, but it's not difficult to construct counter examples).

The histogram is not that good - for the OR queries it often results inover-estimates (for equality conditions on discrete data).

But the MCV estimates are very accurate. The slight under-estimate isprobably caused by the block sampling we're using to get sample rows.



2) continuous data (I'll only show histograms)

CREATE TABLE t (a FLOAT, b FLOAT, c FLOAT);
INSERT INTO t SELECT r,
                     r + r*(random() - 0.5)/2,
                     r + r*(random() - 0.5)/2
               FROM (SELECT random() as r
                       FROM generate_series(1,1000000)) foo;
ANALYZE t;

-- no multivariate stats
EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 and c < 0.3;
Seq Scan on t  (cost=0.00..23870.00 rows=28768 width=24)
               (actual time=0.026..323.383 rows=273897 loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c < 0.3;
Seq Scan on t  (cost=0.00..23870.00 rows=372362 width=24)
               (actual time=0.026..375.005 rows=317533 loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.9;
 Seq Scan on t  (cost=0.00..23870.00 rows=192979 width=24)
                (actual time=0.026..431.376 rows=393528 loops=1)

-- histograms
ALTER TABLE t ADD STATISTICS (histogram) on (a,b,c);
ANALYZE t;

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 and c < 0.3;
Seq Scan on t  (cost=0.00..23870.00 rows=267033 width=24)
               (actual time=0.021..330.487 rows=273897 loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.3;
Seq Scan on t  (cost=0.00..23870.00 rows=14317 width=24)
               (actual time=0.027..906.321 rows=966870 loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.9;
Seq Scan on t  (cost=0.00..23870.00 rows=20367 width=24)
               (actual time=0.028..452.494 rows=393528 loops=1)

This seems wrong, because the estimate for the OR queries should not belower than the estimate for the first query (with just AND), and itshould not increase when increasing the boundary. I'd bet this is a bugin how the inequalities are handled with histograms, or how the AND/ORclauses are combined. I'll look into that.

But once again, there's nothing that would make OR clauses somehowincompatible with multivariate stats.



kind regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multivariate statistics / patch v7

Reply via email to