Re: [HACKERS] multivariate statistics / patch v7

Tomas Vondra Sat, 04 Jul 2015 12:07:58 -0700

Hello Horiguchi-san!

On 07/03/2015 07:30 AM, Kyotaro HORIGUCHI wrote:

Hello, I started to work on this patch.

attached is v7 of the multivariate stats patch. The main improvement
is major refactoring of the clausesel.c portion - splitting the
awfully long spaghetti-style functions into smaller pieces, making it
much more understandable etc.


Thank you, it looks clearer. I have some comment for the brief look
at this. This patchset is relatively large so I will comment on
"per-notice" basis.. which means I'll send comment before examining
the entire of this patchset. Sorry in advance for the desultory
comments.

Sure. If you run into something that's not clear enough, I'm happy toexplain that (I tried to cover all the important details in thecomments, but it's a large patch, indeed.)

=======
General comments:

- You included unnecessary stuffs such like regression.diffs in
   these patches.


Ahhhh :-/ Will fix.


- Now OID 3307 is used by pg_stat_file. I moved
   pg_mv_stats_dependencies_info/show to 3311/3312.


Will fix while rebasing to current master.


- Single-variate stats have a mechanism to inject arbitrary
   values as statistics, that is, get_relation_stats_hook and the
   similar stuffs. I want the similar mechanism for multivariate
   statistics, too.

Fair point, although I'm not sure where should we place the hook, howexactly should it be defined and how useful that would be in the end.Can you give an example of how you'd use such hook?

I've never used get_relation_stats_hook, but if I get it right, theplugins can use the hook to create the stats (for each column), eitherfrom scratch or tweaking the existing stats.

I'm not sure how this should work with multivariate stats, though,because there can be arbitrary number of stats for a column, and itreally depends on all the clauses (so examine_variable() seems a bitinappropriate, as it only sees a single variable at a time).


Moreover, with multivariate stats

   (a) there may be arbitrary number of stats for a column

   (b) only some of the stats end up being used for the estimation

I see two or three possible places for calling such hook:

   (a) at the very beginning, after fetching the list of stats

       - sees all the existing stats on a table
       - may add entirely new stats or tweak the existing ones

   (b) after collecting the list of variables compatible with
       multivariate stats

       - like (a) and additionally knows which columns are interesting
         for the query (but only with respect to the existing stats)

   (c) after optimization (selection of the right combination if stats)

       - like (b), but can't affect the optimization

But I can't really imagine anyone building multivariate stats on thefly, in the hook.

It's more complicated, though, because the query may callclauselist_selectivity multiple times, depending on how complex theWHERE clauses are.

0001:

- I also don't think it is right thing for expression_tree_walker
   to recognize RestrictInfo since it is not a part of expression.

Yes. In my working git repo, I've reworked this to use the secondoption, i.e. adding RestrictInfo pull_(varno|varattno)_walker:


https://github.com/tvondra/postgres/commit/2dc79b914c759d31becd8ae670b37b79663a595f

Do you think this is the correct solution? If not, how to fix it?


0003:

- In clauselist_selectivity, find_stats is uselessly called for
   single clause. This should be called after the clauselist found
   to consist more than one clause.


Ok, will fix.


- Searching vars to be compared with mv-stat columns which
   find_stats does should stop at disjunctions. But this patch
   doesn't behave so and it should be an unwanted behavior. The
   following steps shows that.

Why should it stop at disjunctions? There's nothing wrong with usingmultivariate stats to estimate OR-clauses, IMHO.


====
  =# CREATE TABLE t1 (a int, b int, c int);
  =# INSERT INTO t1 (SELECT a, a * 2, a * 3 FROM generate_series(0, 9999) a);
  =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
   Seq Scan on t1  (cost=0.00..230.00 rows=1 width=12)
  =# ALTER TABLE t1 ADD STATISTICS (HISTOGRAM) ON (a, b, c);
  =# ANALZYE t1;
  =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
   Seq Scan on t1  (cost=0.00..230.00 rows=268 width=12)
====
  Rows changed unwantedly.

That has nothing to do with OR clauses, but rather with using a type ofstatistics that does not fit the data and queries. Histograms are quiteinaccurate for discrete data and equality conditions - in this case theclauses probably match one bucket, and so we use 1/2 the bucket as anestimate. There's nothing wrong with that.


So let's use MCV instead:

ALTER TABLE t1 ADD STATISTICS (MCV) ON (a, b, c);
ANALYZE t1;
EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
                     QUERY PLAN
-----------------------------------------------------
 Seq Scan on t1  (cost=0.00..230.00 rows=1 width=12)
   Filter: (((a = 1) AND (b = 2)) OR (c = 3))
(2 rows)

  It seems not so simple thing as your code assumes.

Maybe, but I don't see what assumption is invalid? I see nothing wrongwith the previous query.


kind regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multivariate statistics / patch v7

Reply via email to