Here’s the logical plan: Project(a.*) Filter(condition: “notIn IS TRUE") GeneralizedSemiJoin(condition=‘left.a=right.b’, computes={notIn: “!= ALL”}) Scan(foo) Scan(bar)
The “notIn” column is a 3-valued boolean column computed for each record from the left input. GeneralizedSemiJoin would be able to compute columns such as “!= ALL”, “!= ANY”, “< ALL”, etc. Each returns TRUE/FALSE/UNKOWN. There may be multiple such columns. A physical implementation would be able to apply filters, e.g. "notIn IS TRUE OR notIn IS UNKNOWN”. And therefore potentially return fewer rows, or convert itself to an ordinary inner semi-join. > On Jul 24, 2020, at 10:05 PM, Haisheng Yuan <hy...@apache.org> wrote: > > I am not sure I get your idea. > What will the logical plan and physical plan look like for the following > query? > SELECT * FROM foo WHERE a NOT IN (SELECT b FROM bar); -- bar.b is nullable > > > On 2020/07/23 01:35:44, Julian Hyde <jhyde.apa...@gmail.com> wrote: >> How about a semi-join algorithm that adds column that hold the nature of the >> match? This algorithm can be used to evaluate 3-valued IN and 3-valued NOT >> IN queries. >> >> Generalizing further, it could compute any “ANY (predicate)” or “ALL >> (predicate)” condition as a column with values TRUE/FALSE/UNKNOWN. For >> instance “empno NOT IN …” is equivalent to “ALL(empno <> …)”. >> >> Here is an example. >> >> Emp table: >> >> empno deptno mgr >> ===== ====== ==== >> 100 10 NULL >> 101 10 100 >> 102 20 101 >> 103 30 100 >> 104 30 103 >> 105 40 NULL >> >> >> The semi-join algorithm calculates the following intermediate result SJ: >> >> empno deptno mgrs match anyNulls allNulls >> ===== ====== =========== ===== ======== ======== >> 100 10 [NULL, 100] TRUE TRUE FALSE >> 101 10 [NULL, 100] FALSE TRUE FALSE >> 102 20 [101] FALSE FALSE FALSE >> 103 30 [100, 103] TRUE FALSE FALSE >> 104 30 [100, 103] FALSE FALSE TRUE >> 105 40 [NULL] FALSE TRUE TRUE >> >> So it can answer the following query: >> >> SELECT empno, >> deptno, >> empno IN (SELECT mgr FROM Emp >> WHERE deptno = e.deptno) AS “in", >> empno NOT IN (SELECT mgr FROM Emp >> WHERE deptno = e.deptno) AS “notIn" >> FROM Emp AS e; >> >> empno deptno in notIn >> ===== ====== ======= ===== >> 100 10 TRUE FALSE >> 101 10 FALSE UNKNOWN >> 102 20 FALSE TRUE >> 103 30 TRUE FALSE >> 104 40 FALSE FALSE >> 105 50 UNKNOWN UNKNOWN >> >> by mapping it to the output of the semi-join algorithm: >> >> SELECT empno, deptno, >> case when allNulls then unknown else match end AS “in”, >> case when anyNulls then unknown else not match end AS “notIn” >> FROM SJ >> >> >> >> >> >> >> >> >>> On Jul 22, 2020, at 8:57 AM, Vladimir Sitnikov >>> <sitnikov.vladi...@gmail.com> wrote: >>> >>> Julian> Vladimir said he *expected* Oracle would implement (3-valued) NOT >>> IN efficiently. (Back in the day, when I was at Oracle, they certainly did >>> not.) Does anyone have any evidence that they do? >>> >>> Well, Oracle has "null aware" joins since Oracle 11g which is more than 10 >>> years old. >>> I have not tested the actual performance of the `null-aware join`, however, >>> it would be extremely surprising, if `null-aware join` was less efficient >>> than "manually crafted aggregate + join + join + whatever". >>> >>> That is why I think it is important to be able to generate regular "not in" >>> plans. >>> >>> Vladimir >> >>