On Tue, Aug 19, 2014 at 8:22 AM, Vladimir Sitnikov
<[email protected]> wrote:
>
> Do you have real life cases that require this not in semantics?
>
> In my experience, this behaviour is typically unexpected by developers.
> I admit we want a compliant implementation, however I wonder if it is worth
> tuning these kind of queries.
I've never come across a case where a developer wanted the bizarre
null-semantics of NOT IN. But as you say, it's the standard, so we
don't have a choice to change it.
I agree that we should not spend a great deal of effort optimizing
these queries.
There are cases where it is convenient to express a query using NOT IN
rather than say NOT EXISTS. If you are sure that your sub-query will
never return NULL, it's a reasonable thing to do. For example, all
departments with no female employees:
SELECT * FROM dept
WHERE deptno NOT IN (SELECT deptno FROM emp WHERE gender = 'F')
is more concise than
SELECT * FROM dept
WHERE NOT EXISTS (
SELECT 1 FROM emp
WHERE emp.deptno = dept.deptno
AND emp.gender = 'F')
and is "safe" because emp.deptno has a NOT NULL constraint.
>
> >(2) it should allow us to do decorrelation
> Can you please explain what is that?
Decorrelation is the process of converting a query that contains one
or more correlated sub-queries (i.e. sub-queries that reference
columns on a table in an enclosing query) into an equivalent query
that has no correlated sub-queries.
Queries without correlations are easier to optimize and are usually
more efficient at run time.
> > (3) it should allow us to represent cases that cannot be represented as
> > semi-join, e.g. ‘(x NOT IN subquery) IS UNKNOWN OR otherCondition'
> Can you please explain why this cannot be represented as a semi-join?
Semi-joins are a sequence of filters. They either remove a row, or do
not. Effectively, they implement two-value boolean logic, and can only
be combined using AND.
If "otherCondition", above, used a sub-query, say a NOT IN or NOT
EXISTS, the translator would have real trouble getting the right
semantics.
A technique that has been useful for all other kinds of sub-queries,
including correlated sub-queries, is what I call a "lookup" query. It
is a generalization of semi-join to return more than true/false.
Suppose you are translating the above NOT EXISTS query. You can
translate it to
SELECT d.*
FROM dept LEFT JOIN (
SELECT DISTINCT valgen.deptno, TRUE AS indicator
FROM (SELECT DISTINCT deptno FROM dept) AS valgen
CROSS JOIN emp
WHERE emp.deptno = valgen.deptno
AND emp.gender = 'F') AS lookup
ON dept.deptno = lookup.deptno
WHERE NOT lookup.indicator
Note that "EXISTS ..." has been converted to "lookup.indicator". This
approach can handle multiple sub-queries, 3-valued boolean logic, and
(for scalar sub-queries) other types besides boolean.
The "LEFT JOIN" combined with the "DISTINCT" ensures that joining to
the sub-query does not increase or decrease the number of rows. Any
filtering happens under the control of the WHERE clause.
I recently added [1] a rule to convert this generalized semi-join into
a regular semi-join if (a) the LEFT JOIN can be converted to INNER,
(b) the filter condition is "lookup.indicator IS TRUE", (c) the rows
from the indicator query are unique on the join columns.
I'd like to extend that rule to deal with "lookup.indicator IS NOT
TRUE", so we can have anti-semi-joins.
Julian
[1] https://issues.apache.org/jira/browse/OPTIQ-368