Hi David,
On 05/17/15 14:31, David Rowley wrote:
Hi Tomas,
I did glance at this patch a while back, but just thinking on it again.
I think if you find any quals that are a part of *any* foreign key
between the 2 join tables, then you should be never assume these quals
to reduce the number of rows. I believe this should be the case even if
you don't fully match all columns of the foreign key.
If we modify your example a little, let's say your foreign key between
fact and dim is made from 3 columns (a,b,c)
If we do:
EXPLAIN SELECT * FROM fact f JOIN dim d USING (a,b);
Then we should always (under normal circumstances) find at least one
matching row, although in this case since the join qual for c is
missing, we could find more than 1 matching row.
Without digging too deep here, I'd say that the best way to do this
would be to either have calc_joinrel_size_estimate() build a list of
restrictinfo items of all quals that are NOT part of any foreign key and
pass that trimmed list down to clauselist_selectivity() for selectivity
estimates. Or perhaps a better way would be just to
teach clauselist_selectivity() about foreign keys. Likely
clauselist_selectivity() would just have to skip over RestrictInfo items
that are part of a foreign key.
I'm not particularly happy about the way the current patch tweaks the
selectivity, but I think simply removing the clauses is not going to
work, because that implies the (removed) conditions have selectivity 1.0
(so the estimate would match a cartesian product). So we really need to
estimate the selectivity, we can't just throw them away.
And that's essentially what the current patch does - it realizes the
clauses match a FK, and then computes the estimate as 1/N where N is the
cardinality of the table with PK.
Another thing is that there may be multiple "candidate" foreign keys,
and we can't just remove clauses matching at least one of them. Imagine
e.g. a (somewhat artificial) example:
CREATE TABLE parent (
a INT,
b INT,
c INT,
d INT,
PRIMARY KEY (a,b),
UNIQUE (c,d)
);
CREATE TABLE child (
a INT,
b INT,
c INT,
d INT,
FOREIGN KEY (a,b) REFERENCES parent (a,b),
FOREIGN KEY (c,d) REFERENCES parent (c,b)
);
and a join on (a,b,c,d). We can't just discard all the conditions,
because (a,b) and (c,d) may 'mismatch'. We know (a,b) and (c,d) matches
about 1/N of the 'parent' table, but we don't know selectivity for the
whole join condition.
And it may be more complex, e.g. there may be two overlapping FKeys,
e.b. (a,b) and (b,c) - how do you split that?
But this may be an overkill (multiple multi-column FK join), although if
we could handle that reasonably, then why not ... someone out there
certainly has schema like that ;-)
What I think is somewhat more realistic is conditions matching only a
subset of FK columns - for example with a FK on (a,b) the join may only
use (a), for example. Then again - we can't just discard that, we need
to estimate that (and use it to compute the actual selectivity).
I agree that if we want to do anything fancy with this, we will have to
teach clauselist_selectivity() about foreign keys.
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers