[ 
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264789#comment-15264789
 ] 

Frederick Reiss commented on SPARK-14781:
-----------------------------------------

[~davies]: I think I have a minimally-invasive plan for covering Q45.

*Existing code:* {{RewritePredicateSubquery.apply}} applies to conjunctions of 
predicates in the WHERE clause. When this rule finds an IN predicate with a 
subquery, the rule rewrites the IN predicate into a join.

*Proposed change:* Modify {{RewritePredicateSubquery}} so that it also detects 
disjunctions (ORs) where exactly one child of the disjunction is an IN 
predicate with a non-correlated subquery. Rewrite each such disjunction into a 
left outer join, followed by a Filter. The inner (right) operand of the left 
outer join should be the subquery with an additional Distinct operator above 
it. The Filter will apply the remaining predicates from the disjunction to any 
tuples that did not join with the subquery.

*Notes:*
The Distinct here is needed because the in-list could contain duplicates. The 
Distinct could be eliminated if there was a join operator that combined the 
behavior of LeftOuter and LeftSemijoin. I suppose that's what SemiPlus will do?

This approach could be extended to cover correlated IN/EXISTS subqueries. The 
rewrite would need to add unique IDs to the outer query's tuples before the 
join + filter, then remove duplicates after the join + filter. I'm *not* 
planning to do this extension in the first pass.

The approach could also be extended to cover multiple subqueries inside a 
disjunction by chaining together multiple outer joins. I'm *not* planning to do 
this extension in the first pass.

*Questions:*
* Do you foresee any problems with this approach?
* There is a second version of the IN/EXISTS subquery rewrite logic in PR 
#12720, but that code hasn't been merged yet. Would you prefer a diff against 
the current head; or a diff against the logic in PR 12720?

> Support subquery in nested predicates
> -------------------------------------
>
>                 Key: SPARK-14781
>                 URL: https://issues.apache.org/jira/browse/SPARK-14781
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Davies Liu
>
> Right now, we does not support nested IN/EXISTS subquery, for example 
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which 
> will output every row from left, plus additional column as the result of join 
> condition. Then we could replace the EXISTS() or IN() by the result column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to