Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/14411 @nsyca We do not rewrite the subquery into a join during analysis. We rewrite subqueries into joins during optimization. We do two things during analysis: 1. We check if the subquery expression is valid. In order to do this we need to check if the query resolves (given the outer query), and that no outer references are used in (the children of) nodes for which the joining behavior is ill defined (`UNION` for instance). 2. We also rewrite IN/EXISTS/Scalar subquery expressions into a PredicateSubquery. We do this by extracting correlated predicates and by rewriting the intermediate tree. One could argue that this could also be done during optimization, but this was needed to get correlated predicates with aggregate functions (referencing the outer query) working (see the example below). For this we needed to push the complete outer condition into the Aggregate below the Having clause. Perhaps there is a simpler way of doing this though. ```SQL select b.key, min(b.value) from src b group by b.key having exists ( select a.key from src a where a.value > 'val_9' and a.value = min(b.value) ) ``` I think we should also limit the use of `Sample`, which also filters non-deterministically and might give us very wrong results as well.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org