Here’s the logical plan:

Project(a.*)
  Filter(condition: “notIn IS TRUE")
    GeneralizedSemiJoin(condition=‘left.a=right.b’, computes={notIn: “!= ALL”})
      Scan(foo)
      Scan(bar)

The “notIn” column is a 3-valued boolean column computed for each record from 
the left input.

GeneralizedSemiJoin would be able to compute columns such as “!= ALL”, “!= 
ANY”, “< ALL”, etc. Each returns TRUE/FALSE/UNKOWN.

There may be multiple such columns.

A physical implementation would be able to apply filters, e.g. "notIn IS TRUE 
OR notIn IS UNKNOWN”. And therefore potentially return fewer rows, or convert 
itself to an ordinary inner semi-join.


> On Jul 24, 2020, at 10:05 PM, Haisheng Yuan <hy...@apache.org> wrote:
> 
> I am not sure I get your idea.
> What will the logical plan and physical plan look like for the following 
> query?
> SELECT * FROM foo WHERE a NOT IN (SELECT b FROM bar); -- bar.b is nullable
> 
> 
> On 2020/07/23 01:35:44, Julian Hyde <jhyde.apa...@gmail.com> wrote: 
>> How about a semi-join algorithm that adds column that hold the nature of the 
>> match? This algorithm can be used to evaluate 3-valued IN and 3-valued NOT 
>> IN queries.
>> 
>> Generalizing further, it could compute any “ANY (predicate)” or “ALL 
>> (predicate)” condition as a column with values TRUE/FALSE/UNKNOWN. For 
>> instance “empno NOT IN …” is equivalent to “ALL(empno <> …)”.
>> 
>> Here is an example.
>> 
>> Emp table:
>> 
>> empno deptno  mgr
>> ===== ====== ====
>>  100     10 NULL
>>  101     10  100
>>  102     20  101
>>  103     30  100
>>  104     30  103
>>  105     40 NULL
>> 
>> 
>> The semi-join algorithm calculates the following intermediate result SJ:
>> 
>> empno deptno mgrs        match anyNulls allNulls
>> ===== ====== =========== ===== ======== ========
>>  100     10 [NULL, 100] TRUE  TRUE     FALSE
>>  101     10 [NULL, 100] FALSE TRUE     FALSE
>>  102     20 [101]       FALSE FALSE    FALSE
>>  103     30 [100, 103]  TRUE  FALSE    FALSE
>>  104     30 [100, 103]  FALSE FALSE    TRUE
>>  105     40 [NULL]      FALSE TRUE     TRUE
>> 
>> So it can answer the following query:
>> 
>> SELECT empno,
>>  deptno,
>>  empno IN (SELECT mgr FROM Emp
>>            WHERE deptno = e.deptno) AS “in",
>>  empno NOT IN (SELECT mgr FROM Emp
>>            WHERE deptno = e.deptno) AS “notIn"
>> FROM Emp AS e;
>> 
>> empno deptno in      notIn
>> ===== ====== ======= =====
>>  100     10 TRUE    FALSE
>>  101     10 FALSE   UNKNOWN
>>  102     20 FALSE   TRUE
>>  103     30 TRUE    FALSE
>>  104     40 FALSE   FALSE
>>  105     50 UNKNOWN UNKNOWN
>> 
>> by mapping it to the output of the semi-join algorithm:
>> 
>> SELECT empno, deptno,
>>  case when allNulls then unknown else match end AS “in”,
>>  case when anyNulls then unknown else not match end AS “notIn”
>> FROM SJ
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On Jul 22, 2020, at 8:57 AM, Vladimir Sitnikov 
>>> <sitnikov.vladi...@gmail.com> wrote:
>>> 
>>> Julian> Vladimir said he *expected* Oracle would implement (3-valued) NOT
>>> IN efficiently. (Back in the day, when I was at Oracle, they certainly did
>>> not.) Does anyone have any evidence that they do?
>>> 
>>> Well, Oracle has "null aware" joins since Oracle 11g which is more than 10
>>> years old.
>>> I have not tested the actual performance of the `null-aware join`, however,
>>> it would be extremely surprising, if `null-aware join` was less efficient
>>> than "manually crafted aggregate + join + join + whatever".
>>> 
>>> That is why I think it is important to be able to generate regular "not in"
>>> plans.
>>> 
>>> Vladimir
>> 
>> 

Reply via email to