[ https://issues.apache.org/jira/browse/SPARK-18597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702505#comment-15702505 ]
Nattavut Sutyanyong edited comment on SPARK-18597 at 11/28/16 5:09 PM: ----------------------------------------------------------------------- I also left another comment in the PR that {{ExistenceJoin}} should be treated the same as {{LeftOuter}} and {{LeftAnti}}, not {{InnerLike}} and {{LeftSemi}}. This is not currently exposed because the rewrite of {{\[NOT\] EXISTS OR ... }} to {{ExistenceJoin}} happens in rule {{RewritePredicateSubquery}}, which is in a separate rule set and placed after the rule {{PushPredicateThroughJoin}}. During the transformation in the rule {{PushPredicateThroughJoin}}, an ExistenceJoin never exists. The semantics of {{ExistenceJoin}} says we need to preserve all the rows from the left table through the join operation as if it is a regular {{LeftOuter}} join. The {{ExistenceJoin}} augments the {{LeftOuter}} operation with a new column called {{exists}}, set to true when the join condition in the ON clause is true and false otherwise. The filter of any rows will happen in the {{Filter}} operation above the {{ExistenceJoin}}. Example: A(c1, c2): { (1, 1), (1, 2) } B(c1): { (NULL) } // can be any value as it is irrelevant in this example {code:SQL} select A.* from A where exists (select 1 from B where A.c1 = A.c2) or A.c2=2 {code} In this example, the correct result is all the rows from A. If the pattern {{ExistenceJoin}} at line 935 in {{Optimizer.scala}} added by the PR of this JIRA is indeed active, the code will push down the predicate A.c1 = A.c2 to be a {{Filter}} on relation A, which will filter the row (1,2) from A. If you agree with my analysis above, I can open a JIRA/a PR to remove this piece of code. was (Author: nsyca): I also left another comment in the PR that {{ExistenceJoin}} should be treated the same as {{LeftOuter}} and {{LeftAnti}}, not {{InnerLike}} and {{LeftSemi}}. This is not currently exposed because the rewrite of {{\[NOT\] EXISTS OR ... }} to {{ExistenceJoin}} happens in rule {{RewritePredicateSubquery}}, which is in a separate rule set and placed after the rule {{PushPredicateThroughJoin}}. During the transformation in the rule {{PushPredicateThroughJoin}}, an ExistenceJoin never exists. The semantics of {{ExistenceJoin}} says we need to preserve all the rows from the left table through the join operation as if it is a regular {{LeftOuter}} join. The {{ExistenceJoin}} augments the {{LeftOuter}} operation with a new column called {{exists}}, set to true when the join condition in the ON clause is true and false otherwise. The filter of any rows will happen in the {{Filter}} operation above the {ExistenceJoin}}. Example: A(c1, c2): { (1, 1), (1, 2) } B(c1): { (NULL) } // can be any value as it is irrelevant in this example {code:SQL} select A.* from A where exists (select 1 from B where A.c1 = A.c2) or A.c2=2 {code} In this example, the correct result is all the rows from A. If the pattern {{ExistenceJoin}} at line 935 in {{Optimizer.scala}} added by the PR of this JIRA is indeed active, the code will push down the predicate A.c1 = A.c2 to be a {{Filter}} on relation A, which will filter the row (1,2) from A. If you agree with my analysis above, I can open a JIRA/a PR to remove this piece of code. > Do not push down filters for LEFT ANTI JOIN > ------------------------------------------- > > Key: SPARK-18597 > URL: https://issues.apache.org/jira/browse/SPARK-18597 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Herman van Hovell > Assignee: Herman van Hovell > Priority: Minor > Labels: correctness > Fix For: 2.1.0 > > > The optimizer pushes down filters for left anti joins. This unfortunately has > the opposite effect. For example: > {noformat} > sql("create or replace temporary view tbl_a as values (1, 5), (2, 1), (3, 6) > as t(c1, c2)") > sql("create or replace temporary view tbl_b as values 1 as t(c1)") > sql(""" > select * > from tbl_a > left anti join tbl_b on ((tbl_a.c1 = tbl_a.c2) is null or tbl_a.c1 = > tbl_a.c2) > """) > {noformat} > Should return rows [1, 5], [2, 1] & [3, 6], but returns no rows. > The upside is that this will only happen when you use a really weird > anti-join (only referencing the table on the left hand side). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org