[ 
https://issues.apache.org/jira/browse/SPARK-18597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702505#comment-15702505
 ] 

Nattavut Sutyanyong edited comment on SPARK-18597 at 11/28/16 5:09 PM:
-----------------------------------------------------------------------

I also left another comment in the PR that {{ExistenceJoin}} should be treated 
the same as {{LeftOuter}} and {{LeftAnti}}, not {{InnerLike}} and {{LeftSemi}}. 
This is not currently exposed because the rewrite of {{\[NOT\] EXISTS OR ... }} 
to {{ExistenceJoin}} happens in rule {{RewritePredicateSubquery}}, which is in 
a separate rule set and placed after the rule {{PushPredicateThroughJoin}}. 
During the transformation in the rule {{PushPredicateThroughJoin}}, an 
ExistenceJoin never exists.

The semantics of {{ExistenceJoin}} says we need to preserve all the rows from 
the left table through the join operation as if it is a regular {{LeftOuter}} 
join. The {{ExistenceJoin}} augments the {{LeftOuter}} operation with a new 
column called {{exists}}, set to true when the join condition in the ON clause 
is true and false otherwise. The filter of any rows will happen in the 
{{Filter}} operation above the {{ExistenceJoin}}.

Example:

A(c1, c2): { (1, 1), (1, 2) }
B(c1): { (NULL) }   // can be any value as it is irrelevant in this example

{code:SQL}
select A.*
from   A
where  exists (select 1 from B where A.c1 = A.c2)
       or A.c2=2
{code}

In this example, the correct result is all the rows from A. If the pattern 
{{ExistenceJoin}} at line 935 in {{Optimizer.scala}} added by the PR of this 
JIRA is indeed active, the code will push down the predicate A.c1 = A.c2 to be 
a {{Filter}} on relation A, which will filter the row (1,2) from A.

If you agree with my analysis above, I can open a JIRA/a PR to remove this 
piece of code.



was (Author: nsyca):
I also left another comment in the PR that {{ExistenceJoin}} should be treated 
the same as {{LeftOuter}} and {{LeftAnti}}, not {{InnerLike}} and {{LeftSemi}}. 
This is not currently exposed because the rewrite of {{\[NOT\] EXISTS OR ... }} 
to {{ExistenceJoin}} happens in rule {{RewritePredicateSubquery}}, which is in 
a separate rule set and placed after the rule {{PushPredicateThroughJoin}}. 
During the transformation in the rule {{PushPredicateThroughJoin}}, an 
ExistenceJoin never exists.

The semantics of {{ExistenceJoin}} says we need to preserve all the rows from 
the left table through the join operation as if it is a regular {{LeftOuter}} 
join. The {{ExistenceJoin}} augments the {{LeftOuter}} operation with a new 
column called {{exists}}, set to true when the join condition in the ON clause 
is true and false otherwise. The filter of any rows will happen in the 
{{Filter}} operation above the {ExistenceJoin}}.

Example:

A(c1, c2): { (1, 1), (1, 2) }
B(c1): { (NULL) }   // can be any value as it is irrelevant in this example

{code:SQL}
select A.*
from   A
where  exists (select 1 from B where A.c1 = A.c2)
       or A.c2=2
{code}

In this example, the correct result is all the rows from A. If the pattern 
{{ExistenceJoin}} at line 935 in {{Optimizer.scala}} added by the PR of this 
JIRA is indeed active, the code will push down the predicate A.c1 = A.c2 to be 
a {{Filter}} on relation A, which will filter the row (1,2) from A.

If you agree with my analysis above, I can open a JIRA/a PR to remove this 
piece of code.


> Do not push down filters for LEFT ANTI JOIN
> -------------------------------------------
>
>                 Key: SPARK-18597
>                 URL: https://issues.apache.org/jira/browse/SPARK-18597
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Herman van Hovell
>            Assignee: Herman van Hovell
>            Priority: Minor
>              Labels: correctness
>             Fix For: 2.1.0
>
>
> The optimizer pushes down filters for left anti joins. This unfortunately has 
> the opposite effect. For example:
> {noformat}
> sql("create or replace temporary view tbl_a as values (1, 5), (2, 1), (3, 6) 
> as t(c1, c2)")
> sql("create or replace temporary view tbl_b as values 1 as t(c1)")
> sql("""
> select *
> from   tbl_a
>        left anti join tbl_b on ((tbl_a.c1 = tbl_a.c2) is null or tbl_a.c1 = 
> tbl_a.c2)
> """)
> {noformat}
> Should return rows [1, 5], [2, 1] & [3, 6], but returns no rows.
> The upside is that this will only happen when you use a really weird 
> anti-join (only referencing the table on the left hand side).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to