"Kevin Grittner" <[EMAIL PROTECTED]> writes: >>> I'm adding some NOT EXISTS examples to the thread for completeness >>> of what someone might want to address while working on it. For two >>> queries which can easily be shown (to a human viewer, anyway) to >>> return identical results, I see performance differences of over >>> five orders of magnitude.
I've been looking at this a bit and have an action plan worked out. I believe that the optimizable cases for EXISTS are those where the EXISTS() is either at the top level of WHERE, or just underneath a NOT, and the contained subselect: * has no set operations (UNION etc), grouping, set-returning functions in the SELECT list, LIMIT, or a few other funny cases * references outer-level variables only in its WHERE clause. Given these conditions, an EXISTS is equivalent to a standard semijoin between the outer relations used in its WHERE clause and the sub-select with the WHERE removed, where the join condition is just the WHERE clause. (A semijoin returns one copy of each LHS row for which there exists at least one RHS row satisfying the join condition.) Similarly, a NOT EXISTS is equivalent to a standard anti-semijoin. (An anti-semijoin returns one copy of each LHS row for which there exists no RHS row satisfying the join condition.) So what I think we should do is implement JOIN_SEMI and JOIN_ANTI as variant outer-join types in the planner and executor, and convert EXISTS into those. JOIN_SEMI would replace the existing JOIN_IN support. (It's effectively the same thing so far as the executor is concerned, but there are some places in the planner that assume an IN join condition consists of one or more btree equality operators. I'm envisioning folding the current planner support for IN into the more general outer-join planning infrastructure, and fixing it so that it doesn't assume the join clause must be of that form.) Explicit support for JOIN_ANTI is probably not strictly necessary: we could represent it using the "LEFT JOIN WHERE rhs-variable IS NULL" hack. However this only works if the join's ON-condition can be proved strict for at least one RHS variable, which is an assumption I'd rather not require for optimizing EXISTS. Also, the planner should be able to make much better estimates of the size of an antijoin result if it has an explicit concept that that's what's happening. (The existing kluge in nulltestsel() is not only ugly but has little hope of ever giving an accurate estimate.) So I'd prefer to go the other way: make the planner recognize the IS NULL trick and remove the IS NULL clause in favor of marking the LEFT JOIN as being really an antijoin. As far as IN goes: IN at the top level of WHERE can still be optimized to a semijoin under the same conditions as now. NOT IN is a lot trickier, for the same reason that typically trips up novices who try to use it: if any row of the subselect produces a NULL comparison result, then it is impossible for the NOT IN to result in TRUE, which means that it does not function as a standard antijoin. I thought about optimizing it only in the case where we can prove that the subselect outputs and the compared-to values are known NOT NULL (which in typical cases we could prove by looking for NOT NULL constraints on those table columns). The trouble with this is that that's not a sufficient condition: you must also assume that the comparison operator involved never yields NULL for non-null inputs. That might be okay for btree comparison functions but it's not a very comfy assumption in general; we certainly haven't got any explicit knowledge that any functions are guaranteed to act that way. So this case might be worth doing later but I'm not feeling excited about it. We generally tell people to avoid NOT IN and I'm happy to keep on saying that. Comments? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers