[ 
https://issues.apache.org/jira/browse/SPARK-16951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413753#comment-15413753
 ] 

Nattavut Sutyanyong edited comment on SPARK-16951 at 8/9/16 3:55 PM:
---------------------------------------------------------------------

Thanks, @hvanhovelll, for the VLDB paper. It is a joyful reading. I take it 
that you mentioned to the last paragraph in section 6 before section 6.1 in the 
paper:


"Suppose the subquery returns the following set of values \{7, 11, NULL\} and 
T1.X has the following set of values: \{NULL, 5, 11\}, The <> ALL operation 
_(@nsyca: that, in the paper, is equivalent to NOT IN)_ can be expressed as 
T1.x <> 7 AND T1.x <> 11 AND T1.x <> NULL. This evaluates to UNKNOWN 
irrespective of the value of T1.x. Thus, for this set of values, Q23 will 
return no rows."

The point here is "what is the correct interpretation of the result from an 
empty subquery?" Take this example:

{noformat}
scala> Seq(1,2).toDF("c1").createOrReplaceTempView("t1")

scala> Seq(1).toDF("c2").createOrReplaceTempView("t2")

scala> sql("select * from t2 where 1=2").createOrReplaceTempView("empty")

scala> sql("select * from t1 where c1 in (select c2 from empty").show
+---+
| c1|
+---+
+---+
{noformat}

The result from the IN subquery statement, running from the Spark master trunk, 
returns no rows. With that, I would conclude that the NOT IN subquery should 
also return no rows.


was (Author: nsyca):
Thanks, @hvanhovelll, for the VLDB paper. It is a joyful reading. I take it 
that you mentioned to the last paragraph in section 6 before section 6.1 in the 
paper:

"Suppose the subquery returns the following set of values \x 7, 11, NULL} and 
T1.X has the following set of values: {NULL, 5, 11}, The <> ALL operation 
_(@nsyca: that, in the paper, is equivalent to NOT IN)_ can be expressed as 
T1.x <> 7 AND T1.x <> 11 AND T1.x <> NULL. This evaluates to UNKNOWN 
irrespective of the value of T1.x. Thus, for this set of values, Q23 will 
return no rows."

> Alternative implementation of NOT IN to Anti-join
> -------------------------------------------------
>
>                 Key: SPARK-16951
>                 URL: https://issues.apache.org/jira/browse/SPARK-16951
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Nattavut Sutyanyong
>
> A transformation currently used to process {{NOT IN}} subquery is to rewrite 
> to a form of Anti-join with null-aware property in the Logical Plan and then 
> translate to a form of {{OR}} predicate joining the parent side and the 
> subquery side of the {{NOT IN}}. As a result, the presence of {{OR}} 
> predicate is limited to the nested-loop join execution plan, which will have 
> a major performance implication if both sides' results are large.
> This JIRA sketches an idea of changing the OR predicate to a form similar to 
> the technique used in the implementation of the Existence join that addresses 
> the problem of {{EXISTS (..) OR ..}} type of queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to