Re: [HACKERS] Multi-column distinctness.

Tomas Vondra Sun, 06 Sep 2015 01:32:55 -0700

Hello Horiguchi-san,

On 08/28/2015 10:33 AM, Kyotaro HORIGUCHI wrote:

Hello, this patch enables planner to be couscious of inter-column
correlation.


Sometimes two or more columns in a table has some correlation
which brings underestimate, which leads to wrong join method and
ends with slow execution.

Tomas Vondra is now working on heavily-equipped multivariate
statistics for OLAP usage. In contrast, this is a lightly
implemented solution which calculates only the ratio between a
rows estimated by current method and a actual row number. I think
this doesn't conflict with his work except the grammar part.

I'd like to sign up as a reviewer of this patch, if you're OK with that.I don't see the patch in any of the commitfests, though :-(

Now, a few comments based on reading the patches (will try testing themin the next few days, hopefully).


1) catalog design
-----------------

I see you've invented a catalog with a slightly different structure thanI use in the multivariate stats patch, storing the coefficient for eachcombination of columns in a separate row. The idea in the multivariatepatch is to generate all possible combinations of columns and store thecoefficients packed in a single row, which requires implementing a morecomplicated data structure. But your patch does not do that (onlycomputes coefficient for the one combination).

This is another potential incompatibility with the multivariate patch,although it's rather in the background and should not be difficult torework if needed.


2) find_mv_coeffeicient
---------------------
I don't quite see why this is the right thing

    /* Prefer smaller one */
    if (mvc->mvccoefficient > 0 && mvc->mvccoefficient < mv_coef)
        mv_coef = mvc->mvccoefficient;

i.e. why it's correct to choose the record with the lower coefficientand not the one with most columns (the largest subset). Maybe there'ssome rule that those are in fact the same thing (seems like adding acolumn can only lower the coefficient), but it's not quite obvious andwould deserve a comment.


3) single statistics limitation
-------------------------------

What annoys me a bit is that this patch only applies a single statistics- it won't try "combining" multiple smaller ones (even if they don'toverlap, which shouldn't be that difficult).


4) no sub-combinations
----------------------

Another slightly annoying thing is that the patch computes statisticsonly for the specified combination of columns, and no subsets. So if youhave enabled stats on [a,b,c] you won't be able to use that forconditions on [a,b], for example. Sure, the user can create the statsmanually, but IMO the less burden we place on the user the better. It'senough that we force users to create the stats at all, we shouldn'treally force them co create all possible combinations.


5) syntax
---------

The syntax might be one of the pain points if we eventually decide tocommit the multivariate stats patch. I have no intention in blockingthis patch for that reasons, but if we might design the syntax to makeit compatible with the multivariate patch, that'd be nice. But it seemsto me the syntax is pretty much the same, no?


I.e. it uses

    ADD STATISTICS (options) ON (columns)

just like the multivariate patch, no? Well, it doesn't really check thestattypes in ATExecAddDropMvStatistics except for checking there's asingle entry, but the syntax seems OK.

BTW mixing ADD and DROP in ATExecAddDropMvStatistics seems a bitconfusing. Maybe two separate methods would be better?


6) GROUP BY
-----------

One thing I'd really love to see is improvement of GROUP BY clauses.That's one of the main sources of pain in analytical queries, IMHO,resulting in either OOM issues in Hash Aggregate, or choice of sortpaths where hash would be more efficient.


regards
Tomas

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multi-column distinctness.

Reply via email to