Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/14411
  
    @nsyca We do not rewrite the subquery into a join during analysis. We 
rewrite subqueries into joins during optimization. We do two things during 
analysis:
    
    1. We check if the subquery expression is valid. In order to do this we 
need to check if the query resolves (given the outer query), and that no outer 
references are used in (the children of) nodes for which the joining behavior 
is ill defined (`UNION` for instance).
    2. We also rewrite IN/EXISTS/Scalar subquery expressions into a 
PredicateSubquery. We do this by extracting correlated predicates and by 
rewriting the intermediate tree. One could argue that this could also be done 
during optimization, but this was needed to get correlated predicates with 
aggregate functions (referencing the outer query) working (see the example 
below). For this we needed to push the complete outer condition into the 
Aggregate below the Having clause. Perhaps there is a simpler way of doing this 
though.
    ```SQL
    select b.key, min(b.value)
    from src b
    group by b.key
    having exists ( select a.key
                    from src a
                    where a.value > 'val_9' and a.value = min(b.value)
                    )
    ```
    
    I think we should also limit the use of `Sample`, which also filters 
non-deterministically and might give us very wrong results as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to