[ 
https://issues.apache.org/jira/browse/SPARK-16951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413153#comment-15413153
 ] 

Herman van Hovell commented on SPARK-16951:
-------------------------------------------

Case 1 is a result of an optimizer bug. Apparently we are pushing the NOT IN 
predicate through the join. This is the plan:
{noformat}
== Parsed Logical Plan ==
'Project [*]
+- 'Filter NOT 'c3 IN (list#39)
   :  +- 'SubqueryAlias list#39
   :     +- 'Project ['c2]
   :        +- 'UnresolvedRelation `t2`
   +- 'UnresolvedRelation `t3`

== Analyzed Logical Plan ==
c3: int
Project [c3#14]
+- Filter NOT predicate-subquery#39 [(c3#14 = c2#10)]
   :  +- SubqueryAlias predicate-subquery#39 [(c3#14 = c2#10)]
   :     +- Project [c2#10]
   :        +- SubqueryAlias t2
   :           +- Project [value#8 AS c2#10]
   :              +- LocalRelation [value#8]
   +- SubqueryAlias t3
      +- Project [(c2#10 + 1) AS c3#14]
         +- Join LeftOuter, (c1#3 = c2#10)
            :- SubqueryAlias t1
            :  +- Project [value#1 AS c1#3]
            :     +- LocalRelation [value#1]
            +- SubqueryAlias t2
               +- Project [value#8 AS c2#10]
                  +- LocalRelation [value#8]

== Optimized Logical Plan ==
Project [(c2#10 + 1) AS c3#14]
+- Join LeftOuter, (c1#3 = c2#10)
   :- Project [value#1 AS c1#3]
   :  +- Join LeftAnti, (isnull(((c2#10 + 1) = c2#10)) || ((c2#10 + 1) = c2#10))
   :     :- LocalRelation [value#1]
   :     +- LocalRelation [c2#10]
   +- LocalRelation [c2#10]

== Physical Plan ==
*Project [(c2#10 + 1) AS c3#14]
+- *BroadcastHashJoin [c1#3], [c2#10], LeftOuter, BuildRight
   :- *Project [value#1 AS c1#3]
   :  +- BroadcastNestedLoopJoin BuildRight, LeftAnti, (isnull(((c2#10 + 1) = 
c2#10)) || ((c2#10 + 1) = c2#10)), true
   :     :- LocalTableScan [value#1]
   :     +- BroadcastExchange IdentityBroadcastMode
   :        +- LocalTableScan [c2#10]
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
false] as bigint)))
      +- LocalTableScan [c2#10]
{noformat}
This shouldn't be to hard to fix. I can confirm that disabling the 
{{PushPredicateThroughJoin}} rule fixes this problem.

As for case 3. I think the current behavior is correct. If there are no rows in 
the inner table, then all the rows in the outer table should be returned. This 
paper seems to agree with me: http://www.vldb.org/pvldb/2/vldb09-423.pdf 
(section 6.1)



> Alternative implementation of NOT IN to Anti-join
> -------------------------------------------------
>
>                 Key: SPARK-16951
>                 URL: https://issues.apache.org/jira/browse/SPARK-16951
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Nattavut Sutyanyong
>
> A transformation currently used to process {{NOT IN}} subquery is to rewrite 
> to a form of Anti-join with null-aware property in the Logical Plan and then 
> translate to a form of {{OR}} predicate joining the parent side and the 
> subquery side of the {{NOT IN}}. As a result, the presence of {{OR}} 
> predicate is limited to the nested-loop join execution plan, which will have 
> a major performance implication if both sides' results are large.
> This JIRA sketches an idea of changing the OR predicate to a form similar to 
> the technique used in the implementation of the Existence join that addresses 
> the problem of {{EXISTS (..) OR ..}} type of queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to