Since 8.4, we've been using the definition that join selectivity for a clause in a semijoin is the fraction of outer-relation tuples that pass the clause (or more strictly, for which there exists at least one inner-relation tuple with which the clause succeeds). While this looks fairly straightforward, it's become apparent to me today that this definition fails to support one of the usual expectations for selectivity, namely
sel(NOT X) = 1 - sel(X) (Please ignore questions of nulls for the moment; things are bad enough without that.) The reason is that, if sel(NOT X) is the probability that there is an inner row for which NOT X succeeds, then this is not one minus the probability that there is an inner row for which X succeeds; rather it's one minus the probability that X succeeds for *every* inner row. It was noted earlier today in pgsql-performance that this breaks neqjoinsel's implementation in terms of eqjoinsel; but it also means that clause_selectivity's handling of NOT clauses is wrong in this context, and more generally there are all sorts of gotchas in terms of trying to reason about the relationships of related selectivities. I can't avoid the feeling that this means we should be using some other definition for semijoin selectivity. I can't find any other definition in use in the literature, though. Right at the moment the consequences are pretty limited, since eqjoinsel is actually the only code that tries to compute a special selectivity estimate for semijoin/antijoin contexts. But I can foresee this becoming a real mess if we try to expand our intelligence about such cases. Any thoughts out there? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers