[ 
https://issues.apache.org/jira/browse/SPARK-16951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880694#comment-15880694
 ] 

Nattavut Sutyanyong commented on SPARK-16951:
---------------------------------------------

I am going to close off this JIRA at this point. This task never get going as I 
expected originally due to other tasks that yield higher benefit. NOT IN with 
null handling is an edge use case. Users usually can get away with the subtly 
different semantics of NOT EXISTS that isn't sensitive to NULL values.

I want to put my thought here just in case someone is willing to continue in 
the future.

A possibly better alternative is to implement the null-aware logic at the 
run-time join operations. Hash join and sorted-merge join should be easily 
extended to handle the presence of null values. Nested-loop join is not going 
to be easy. So the anti join converted from NOT IN will have to limit to just 
the join methods that implement the null-aware logic.

A summary of the null-aware logic is:
1. If the subquery side has a null value of the join column, the result (of the 
anti join) is an empty set.
2. If the subquery side produces an empty set (after any local predicates are 
applied), all the rows from the parent table are returned.
3. If the subquery side is not an empty set and has not a single row with null 
value on the join column, the rows with null value on the join column from the 
parent table is excluded from the result set.

> Alternative implementation of NOT IN to Anti-join
> -------------------------------------------------
>
>                 Key: SPARK-16951
>                 URL: https://issues.apache.org/jira/browse/SPARK-16951
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Nattavut Sutyanyong
>
> A transformation currently used to process {{NOT IN}} subquery is to rewrite 
> to a form of Anti-join with null-aware property in the Logical Plan and then 
> translate to a form of {{OR}} predicate joining the parent side and the 
> subquery side of the {{NOT IN}}. As a result, the presence of {{OR}} 
> predicate is limited to the nested-loop join execution plan, which will have 
> a major performance implication if both sides' results are large.
> This JIRA sketches an idea of changing the OR predicate to a form similar to 
> the technique used in the implementation of the Existence join that addresses 
> the problem of {{EXISTS (..) OR ..}} type of queries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to