Hi David,

On 05/17/15 14:31, David Rowley wrote:
Hi Tomas,

I did glance at this patch a while back, but just thinking on it again.

I think if you find any quals that are a part of *any* foreign key
between the 2 join tables, then you should be never assume these quals
to reduce the number of rows. I believe this should be the case even if
you don't fully match all columns of the foreign key.

If we modify your example a little, let's say your foreign key between
fact and dim is made from 3 columns (a,b,c)

If we do:

EXPLAIN SELECT * FROM fact f JOIN dim d USING (a,b);

Then we should always (under normal circumstances) find at least one
matching row, although in this case since the join qual for c is
missing, we could find more than 1 matching row.

Without digging too deep here, I'd say that the best way to do this
would be to either have calc_joinrel_size_estimate() build a list of
restrictinfo items of all quals that are NOT part of any foreign key and
pass that trimmed list down to clauselist_selectivity() for selectivity
estimates. Or perhaps a better way would be just to
teach clauselist_selectivity() about foreign keys. Likely
clauselist_selectivity() would just have to skip over RestrictInfo items
that are part of a foreign key.

I'm not particularly happy about the way the current patch tweaks the selectivity, but I think simply removing the clauses is not going to work, because that implies the (removed) conditions have selectivity 1.0 (so the estimate would match a cartesian product). So we really need to estimate the selectivity, we can't just throw them away.

And that's essentially what the current patch does - it realizes the clauses match a FK, and then computes the estimate as 1/N where N is the cardinality of the table with PK.

Another thing is that there may be multiple "candidate" foreign keys, and we can't just remove clauses matching at least one of them. Imagine e.g. a (somewhat artificial) example:

    CREATE TABLE parent (
       a INT,
       b INT,
       c INT,
       d INT,
       PRIMARY KEY (a,b),
       UNIQUE (c,d)
    );

    CREATE TABLE child (
       a INT,
       b INT,
       c INT,
       d INT,
       FOREIGN KEY (a,b) REFERENCES parent (a,b),
       FOREIGN KEY (c,d) REFERENCES parent (c,b)
    );

and a join on (a,b,c,d). We can't just discard all the conditions, because (a,b) and (c,d) may 'mismatch'. We know (a,b) and (c,d) matches about 1/N of the 'parent' table, but we don't know selectivity for the whole join condition.

And it may be more complex, e.g. there may be two overlapping FKeys, e.b. (a,b) and (b,c) - how do you split that?

But this may be an overkill (multiple multi-column FK join), although if we could handle that reasonably, then why not ... someone out there certainly has schema like that ;-)

What I think is somewhat more realistic is conditions matching only a subset of FK columns - for example with a FK on (a,b) the join may only use (a), for example. Then again - we can't just discard that, we need to estimate that (and use it to compute the actual selectivity).

I agree that if we want to do anything fancy with this, we will have to teach clauselist_selectivity() about foreign keys.

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to