[ 
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265000#comment-15265000
 ] 

Frederick Reiss commented on SPARK-14781:
-----------------------------------------

Yeah, Distinct will impact performance for the uncorrelated case if the 
subquery returns more than a few million rows. That problem won't occur in the 
particular case of TPC-DS query 45 (the subquery there returns at most 500k 
rows at a 100TB scale factor), but you never know. And of course a Distinct 
after the join, as one would need to cover EXISTS, would see potentially 
billions of rows. I just figured I'd mention that possibility as an expedient 
that doesn't require any additional operators.

I'd be up to adding a "LeftSemiPlus" mode to the various join operators if 
you'd prefer for implementation to start with that step. The new behavior is 
almost the same as the existing LeftSemi mode: one additional output column in 
the schema, plus code to emit rows with a null value when nothing on the inner 
matches an outer tuple.

> Support subquery in nested predicates
> -------------------------------------
>
>                 Key: SPARK-14781
>                 URL: https://issues.apache.org/jira/browse/SPARK-14781
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Davies Liu
>
> Right now, we does not support nested IN/EXISTS subquery, for example 
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which 
> will output every row from left, plus additional column as the result of join 
> condition. Then we could replace the EXISTS() or IN() by the result column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to