[ https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264789#comment-15264789 ]
Frederick Reiss commented on SPARK-14781: ----------------------------------------- [~davies]: I think I have a minimally-invasive plan for covering Q45. *Existing code:* {{RewritePredicateSubquery.apply}} applies to conjunctions of predicates in the WHERE clause. When this rule finds an IN predicate with a subquery, the rule rewrites the IN predicate into a join. *Proposed change:* Modify {{RewritePredicateSubquery}} so that it also detects disjunctions (ORs) where exactly one child of the disjunction is an IN predicate with a non-correlated subquery. Rewrite each such disjunction into a left outer join, followed by a Filter. The inner (right) operand of the left outer join should be the subquery with an additional Distinct operator above it. The Filter will apply the remaining predicates from the disjunction to any tuples that did not join with the subquery. *Notes:* The Distinct here is needed because the in-list could contain duplicates. The Distinct could be eliminated if there was a join operator that combined the behavior of LeftOuter and LeftSemijoin. I suppose that's what SemiPlus will do? This approach could be extended to cover correlated IN/EXISTS subqueries. The rewrite would need to add unique IDs to the outer query's tuples before the join + filter, then remove duplicates after the join + filter. I'm *not* planning to do this extension in the first pass. The approach could also be extended to cover multiple subqueries inside a disjunction by chaining together multiple outer joins. I'm *not* planning to do this extension in the first pass. *Questions:* * Do you foresee any problems with this approach? * There is a second version of the IN/EXISTS subquery rewrite logic in PR #12720, but that code hasn't been merged yet. Would you prefer a diff against the current head; or a diff against the logic in PR 12720? > Support subquery in nested predicates > ------------------------------------- > > Key: SPARK-14781 > URL: https://issues.apache.org/jira/browse/SPARK-14781 > Project: Spark > Issue Type: New Feature > Components: SQL > Reporter: Davies Liu > > Right now, we does not support nested IN/EXISTS subquery, for example > EXISTS( x1) OR EXISTS( x2) > In order to do that, we could use an internal-only join type SemiPlus, which > will output every row from left, plus additional column as the result of join > condition. Then we could replace the EXISTS() or IN() by the result column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org