[ https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265000#comment-15265000 ]
Frederick Reiss commented on SPARK-14781: ----------------------------------------- Yeah, Distinct will impact performance for the uncorrelated case if the subquery returns more than a few million rows. That problem won't occur in the particular case of TPC-DS query 45 (the subquery there returns at most 500k rows at a 100TB scale factor), but you never know. And of course a Distinct after the join, as one would need to cover EXISTS, would see potentially billions of rows. I just figured I'd mention that possibility as an expedient that doesn't require any additional operators. I'd be up to adding a "LeftSemiPlus" mode to the various join operators if you'd prefer for implementation to start with that step. The new behavior is almost the same as the existing LeftSemi mode: one additional output column in the schema, plus code to emit rows with a null value when nothing on the inner matches an outer tuple. > Support subquery in nested predicates > ------------------------------------- > > Key: SPARK-14781 > URL: https://issues.apache.org/jira/browse/SPARK-14781 > Project: Spark > Issue Type: New Feature > Components: SQL > Reporter: Davies Liu > > Right now, we does not support nested IN/EXISTS subquery, for example > EXISTS( x1) OR EXISTS( x2) > In order to do that, we could use an internal-only join type SemiPlus, which > will output every row from left, plus additional column as the result of join > condition. Then we could replace the EXISTS() or IN() by the result column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org