Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Konstantin Knizhnik Tue, 17 Jul 2018 15:42:34 -0700



On 16.07.2018 23:55, Tomas Vondra wrote:


On 07/16/2018 02:54 PM, Dean Rasheed wrote:

On 16 July 2018 at 13:23, Tomas Vondra <[email protected]> wrote:

The top-level clauses allow us to make such deductions, with deeper
clauses it's much more difficult (perhaps impossible). Because for
example with (a=1 AND b=1) there can be just a single match, so if we
find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR
b=2)) it's not that simple, because there may be multiple combinations
and so a match in MCV does not guarantee anything.

Actually, it guarantees a lower bound on the overall selectivity, and
maybe that's the best that we can do in the absence of any other
stats.

Hmmm, is that actually true? Let's consider a simple example, with two
columns, each with just 2 values, and a "perfect" MCV list:

     a | b | frequency
    -------------------
     1 | 1 | 0.5
     2 | 2 | 0.5

And let's estimate sel(a=1 & b=2).

OK.In this case, there are no MCV matches, so there is no lower bound (it's 0).

What we could do though is also impose an upper bound, based on the
sum of non-matching MCV frequencies. In this case, the upper bound is
also 0, so we could actually say the resulting selectivity is 0.

Hmmm, it's not very clear to me how would we decide which of these cases
applies, because in most cases we don't have MCV covering 100% rows.

Anyways, I've been thinking about how to modify the code to wort the way
you proposed (in a way sufficient for a PoC). But after struggling with
it for a while it occurred to me it might be useful to do it on paper
first, to verify how would it work on your examples.

So let's use this data

create table foo(a int, b int);
insert into foo select 1,1 from generate_series(1,50000);
insert into foo select 1,2 from generate_series(1,40000);
insert into foo select 1,x/10 from generate_series(30,250000) g(x);
insert into foo select 2,1 from generate_series(1,30000);
insert into foo select 2,2 from generate_series(1,20000);
insert into foo select 2,x/10 from generate_series(30,500000) g(x);
insert into foo select 3,1 from generate_series(1,10000);
insert into foo select 3,2 from generate_series(1,5000);
insert into foo select 3,x from generate_series(3,600000) g(x);
insert into foo select x,x/10 from generate_series(4,750000) g(x);

Assuming we have perfectly exact statistics, we have these MCV lists
(both univariate and multivariate):

   select a, count(*), round(count(*) /2254937.0, 4) AS frequency
     from foo group by a order by 2 desc;

      a    | count  | frequency
   --------+--------+-----------
         3 | 614998 |    0.2727
         2 | 549971 |    0.2439
         1 | 339971 |    0.1508
      1014 |      1 |    0.0000
     57220 |      1 |    0.0000
     ...

   select b, count(*), round(count(*) /2254937.0, 4) AS frequency
     from foo group by b order by 2 desc;

      b    | count | frequency
   --------+-------+-----------
         1 | 90010 |    0.0399
         2 | 65010 |    0.0288
         3 |    31 |    0.0000
         7 |    31 |    0.0000
        ...

   select a, b, count(*), round(count(*) /2254937.0, 4) AS frequency
     from foo group by a, b order by 3 desc;

      a    |   b    | count | frequency
   --------+--------+-------+-----------
         1 |      1 | 50000 |    0.0222
         1 |      2 | 40000 |    0.0177
         2 |      1 | 30000 |    0.0133
         2 |      2 | 20000 |    0.0089
         3 |      1 | 10000 |    0.0044
         3 |      2 |  5000 |    0.0022
         2 |  12445 |    10 |    0.0000
         ...

And let's estimate the two queries with complex clauses, where the
multivariate stats gave 2x overestimates.

SELECT * FROM foo WHERE a=1 and (b=1 or b=2);
-- actual 90000, univariate: 24253, multivariate: 181091

    univariate:

      sel(a=1) = 0.1508
      sel(b=1) = 0.0399
      sel(b=2) = 0.0288
      sel(b=1 or b=2) = 0.0673

    multivariate:
      sel(a=1 and (b=1 or b=2)) = 0.0399 (0.0770)

The second multivariate estimate comes from assuming only the first 5
items make it to the multivariate MCV list (covering 6.87% of the data)
and extrapolating the selectivity to the non-MCV data too.

(Notice it's about 2x the actual selectivity, so this extrapolation due
to not realizing the MCV already contains all the matches is pretty much
responsible for the whole over-estimate).

So, how would the proposed algorithm work? Let's start with "a=1":

    sel(a=1) = 0.1508

I don't see much point in applying the two "b" clauses independently (or
how would it be done, as it's effectively a single clause):

    sel(b=1 or b=2) = 0.0673

And we get

    total_sel = sel(a=1) * sel(b=1 or b=2) = 0.0101

 From the multivariate MCV we get

    mcv_sel = 0.0399

And finally

    total_sel = Max(total_sel, mcv_sel) = 0.0399

Which is great, but I don't see how that actually helped anything? We
still only have the estimate for the ~7% covered by the MCV list, and
not the remaining non-MCV part.

I could do the same thing for the second query, but the problem there is
actually exactly the same - extrapolation from MCV to non-MCV part
roughly doubles the estimate.

So unless I'm applying your algorithm incorrectly, this does not seem
like a very promising direction :-(

There may be valuable information we could learn from the univariate
estimates (using a Max() of them as an upper boundary seems reasonable),
but that's still quite crude. And it will only ever work with simple
top-level clauses. Once the clauses get more complicated, it seems
rather tricky - presumably multivariate stats would be only used for
correlated columns, so trying to deduce something from univariate
estimates on complex clauses on such columns seems somewhat suspicious.


regards

Teodor Sigaev has proposed an alternative approach for calculatingselectivity of multicolumn join or compound index search.Usually DBA creates compound indexes which can be used by optimizer tobuild efficient query execution plan based on index search.We can stores statistic for compound keys of such indexes and (as it isdone now for functional indexes) and use it to estimate selectivityof clauses. I have implemented this idea and will publish patch inseparate thread soon.

Now I just want to share some results for the Tomas examples.

So for Vanilla Postges without extra statistic estimated number of rowsis about 4 times smaller than real.

postgres=# explain analyze SELECT count(*) FROM foo WHERE a=1 and (b=1or b=2);

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=10964.76..10964.77 rows=1 width=8) (actualtime=49.251..49.251 rows=1 loops=1) -> Bitmap Heap Scan on foo (cost=513.60..10906.48 rows=23310width=0) (actual time=17.368..39.928 rows=90000 loops=1)

         Recheck Cond: (((a = 1) AND (b = 1)) OR ((a = 1) AND (b = 2)))
         Heap Blocks: exact=399

-> BitmapOr (cost=513.60..513.60 rows=23708 width=0) (actualtime=17.264..17.264 rows=0 loops=1) -> Bitmap Index Scan on foo_a_b_idx (cost=0.00..295.41rows=13898 width=0) (actual time=10.319..10.319 rows=50000 loops=1)

                     Index Cond: ((a = 1) AND (b = 1))

-> Bitmap Index Scan on foo_a_b_idx (cost=0.00..206.53rows=9810 width=0) (actual time=6.941..6.941 rows=40000 loops=1)

                     Index Cond: ((a = 1) AND (b = 2))



If we add statistic for a and b  columns:

     create statistics ab on a,b from foo;
     analyze foo;

then expected results is about 30% larger then real: 120k vs 90k:

postgres=# explain analyze SELECT count(*) FROM foo WHERE a=1 and (b=1or b=2);

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------

Finalize Aggregate (cost=14447.11..14447.12 rows=1 width=8) (actualtime=36.048..36.048 rows=1 loops=1) -> Gather (cost=14446.90..14447.11 rows=2 width=8) (actualtime=35.982..36.037 rows=3 loops=1)

         Workers Planned: 2
         Workers Launched: 2

-> Partial Aggregate (cost=13446.90..13446.91 rows=1width=8) (actual time=30.172..30.172 rows=1 loops=3) -> Parallel Bitmap Heap Scan on foo(cost=2561.33..13424.24 rows=9063 width=0) (actual time=15.551..26.057rows=30000 loops=3) Recheck Cond: (((a = 1) AND (b = 1)) OR ((a = 1)AND (b = 2)))

                     Heap Blocks: exact=112

-> BitmapOr (cost=2561.33..2561.33 rows=121360width=0) (actual time=20.304..20.304 rows=0 loops=1) -> Bitmap Index Scan on foo_a_b_idx(cost=0.00..1488.46 rows=70803 width=0) (actual time=13.190..13.190rows=50000 loops=1)

                                 Index Cond: ((a = 1) AND (b = 1))

-> Bitmap Index Scan on foo_a_b_idx(cost=0.00..1061.99 rows=50556 width=0) (actual time=7.110..7.110rows=40000 loops=1)

                                 Index Cond: ((a = 1) AND (b = 2))

With compound index statistic estimation is almost equal to real value:

---------------------------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=13469.94..13469.95 rows=1 width=8) (actualtime=70.710..70.710 rows=1 loops=1) -> Bitmap Heap Scan on foo (cost=1880.20..13411.66 rows=23310width=0) (actual time=38.776..61.050 rows=90000 loops=1)

         Recheck Cond: (((a = 1) AND (b = 1)) OR ((a = 1) AND (b = 2)))
         Heap Blocks: exact=399

-> BitmapOr (cost=1880.20..1880.20 rows=88769 width=0)(actual time=38.618..38.618 rows=0 loops=1) -> Bitmap Index Scan on foo_a_b_idx (cost=0.00..1030.50rows=49007 width=0) (actual time=26.335..26.335 rows=50000 loops=1)

                     Index Cond: ((a = 1) AND (b = 1))

-> Bitmap Index Scan on foo_a_b_idx (cost=0.00..838.05rows=39762 width=0) (actual time=12.278..12.278 rows=40000 loops=1)

                     Index Cond: ((a = 1) AND (b = 2))

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Reply via email to