[HACKERS] Semijoin selectivity estimation

Tom Lane Fri, 21 Jan 2011 16:04:41 -0800

Since 8.4, we've been using the definition that join selectivity for
a clause in a semijoin is the fraction of outer-relation tuples that
pass the clause (or more strictly, for which there exists at least
one inner-relation tuple with which the clause succeeds).  While this
looks fairly straightforward, it's become apparent to me today that
this definition fails to support one of the usual expectations for
selectivity, namely


        sel(NOT X) = 1 - sel(X)

(Please ignore questions of nulls for the moment; things are bad enough
without that.)  The reason is that, if sel(NOT X) is the probability
that there is an inner row for which NOT X succeeds, then this is not
one minus the probability that there is an inner row for which X
succeeds; rather it's one minus the probability that X succeeds for
*every* inner row.

It was noted earlier today in pgsql-performance that this breaks
neqjoinsel's implementation in terms of eqjoinsel; but it also means
that clause_selectivity's handling of NOT clauses is wrong in this
context, and more generally there are all sorts of gotchas in terms of
trying to reason about the relationships of related selectivities.

I can't avoid the feeling that this means we should be using some other
definition for semijoin selectivity.  I can't find any other definition
in use in the literature, though.

Right at the moment the consequences are pretty limited, since eqjoinsel
is actually the only code that tries to compute a special selectivity
estimate for semijoin/antijoin contexts.  But I can foresee this
becoming a real mess if we try to expand our intelligence about such
cases.

Any thoughts out there?

                        regards, tom lane

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Semijoin selectivity estimation

Reply via email to