GitHub user nsyca opened a pull request:

    https://github.com/apache/spark/pull/17520

    [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates and 
RewritePredicateSubquery after OptimizeSubqueries

    ## What changes were proposed in this pull request?
    This commit moves two rules right next to the rule OptimizeSubqueries.
    
    1. PullupCorrelatedPredicates: the rewrite of [Not] Exists and [Not] In 
(ListQuery) to PredicateSubquery
    2. RewritePredicateSubquery: the rewrite of PredicateSubquery to 
LeftSemi/LeftAnti
      
    With this change, [Not] Exists/In subquery is now rewritten to 
LeftSemi/LeftAnti at the beginning of Optimizer.
        
    By moving rule PullupCorrelatedPredicates after rule OptimizerSubqueries, 
all the rules from the nested call to the entire Optimizer on the plans in 
subqueries will need to deal with (1) the correlated columns wrapped with 
OuterReference, and (2) the SubqueryExpression.
        
    We will block any push down of both types of expressions for the following 
reasons:
        
    1. We do not want to push any correlated expressions further down the plan 
tree. Deep correlation is not yet supported in Spark, and, even when supported, 
deep correlation is more difficult to be unnested to a join. 
    2. We do not want to push any correlated subquery down because the 
correlated columns' ExprIds in the subquery may need to remap to different 
ExprIds from the plan below the current Filter that hosts the subquery.
    
    One side effect is we used to push down Exists/In subquery as if it is a 
predicate in rule PushDownPredicate and rule PushPredicateThroughJoin. Now 
Exists/In subquery is rewritten to LeftSemi/LeftAnti, we need to handle the 
push down of LeftSemi/LeftAnti instead. This will be done in a followup commit.
        
    Another Todo is to merge the two-stage rewrite in rule 
PullupCorrelatedPredicates and rule RewritePredicateSubquery into a single 
stage rewrite rule.
    
    ## How was this patch tested?
    Unit tests with test cases in SQLQueryTestSuite under the directory 
./sql/core/src/test/resources/sql-tests/inputs/subquery. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nsyca/spark 19712-1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17520.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17520
    
----
commit b98865127a39bde885f9b1680cfe608629d59d51
Author: Nattavut Sutyanyong <nsy....@gmail.com>
Date:   2016-07-29T21:43:56Z

    [SPARK-16804][SQL] Correlated subqueries containing LIMIT return incorrect 
results
    
    ## What changes were proposed in this pull request?
    
    This patch fixes the incorrect results in the rule ResolveSubquery in 
Catalyst's Analysis phase.
    
    ## How was this patch tested?
    ./dev/run-tests
    a new unit test on the problematic pattern.

commit 069ed8f8e5f14dca7a15701945d42fc27fe82f3c
Author: Nattavut Sutyanyong <nsy....@gmail.com>
Date:   2016-07-29T21:50:02Z

    [SPARK-16804][SQL] Correlated subqueries containing LIMIT return incorrect 
results
    
    ## What changes were proposed in this pull request?
    
    This patch fixes the incorrect results in the rule ResolveSubquery in 
Catalyst's Analysis phase.
    
    ## How was this patch tested?
    ./dev/run-tests
    a new unit test on the problematic pattern.

commit edca333c081e6d4e53a91b496fba4a3ef4ee89ac
Author: Nattavut Sutyanyong <nsy....@gmail.com>
Date:   2016-07-30T00:28:15Z

    New positive test cases

commit 64184fdb77c1a305bb2932e82582da28bb4c0e53
Author: Nattavut Sutyanyong <nsy....@gmail.com>
Date:   2016-08-01T13:20:09Z

    Fix unit test case failure

commit 29f82b05c9e40e7934397257c674b260a8e8a996
Author: Nattavut Sutyanyong <nsy....@gmail.com>
Date:   2016-08-05T17:42:01Z

    blocking TABLESAMPLE

commit ac43ab47907a1ccd6d22f920415fbb4de93d4720
Author: Nattavut Sutyanyong <nsy....@gmail.com>
Date:   2016-08-05T21:10:19Z

    Fixing code styling

commit 631d396031e8bf627eb1f4872a4d3a17c144536c
Author: Nattavut Sutyanyong <nsy....@gmail.com>
Date:   2016-08-07T18:39:44Z

    Correcting Scala test style

commit 7eb9b2dbba3633a1958e38e0019e3ce816300514
Author: Nattavut Sutyanyong <nsy....@gmail.com>
Date:   2016-08-08T02:31:09Z

    One (last) attempt to correct the Scala style tests

commit 1387cf51541408ac20048064fa5e559836af932c
Author: Nattavut Sutyanyong <nsy....@gmail.com>
Date:   2016-08-12T20:11:50Z

    Merge remote-tracking branch 'upstream/master'

commit 648afac8d35f557ca48d19b93956a9e0fbc6ea6e
Author: Nattavut Sutyanyong <nsy....@gmail.com>
Date:   2017-03-14T14:12:59Z

    Merge remote-tracking branch 'upstream/master'

commit dfd476da6a9a75a36c0c01d1b6188610f213133e
Author: Nattavut Sutyanyong <nsy....@gmail.com>
Date:   2017-03-16T14:16:01Z

    Merge remote-tracking branch 'upstream/master'

commit 9e1c18c9551bb5c74f7bb6c0e13a75dafe0fb859
Author: Nattavut Sutyanyong <nsy....@gmail.com>
Date:   2017-03-20T14:49:38Z

    Merge remote-tracking branch 'upstream/master'

commit bc4fe9326e3c33954d223746ec36fb990fb8d994
Author: Nattavut Sutyanyong <nsy....@gmail.com>
Date:   2017-03-22T23:10:17Z

    Move PullupCorrelatedPredicates and RewritePredicateSubquery after 
OptimizeSubqueries
    
    This commit moves two rules right next to the rule OptimizeSubqueries.
     1. PullupCorrelatedPredicates:
        the rewrite of [Not] Exists and [Not] In (ListQuery) to 
PredicateSubquery
     2. RewritePredicateSubquery:
        the rewrite of PredicateSubquery to LeftSemi/LeftAnti
    
    With this change, [Not] Exists/In subquery is now rewritten to 
LeftSemi/LeftAnti
    at the beginning of Optimizer.
    
    By moving rule PullupCorrelatedPredicates after rule OptimizerSubqueries, 
all
    the rules from the nested call to the entire Optimizer on the plans in 
subqueries
    will need to deal with (1). the correlated columns wrapped with 
OuterReference,
    and (2) the SubqueryExpression.
    
    We will block any push down of both types of expressions for the following 
reasons:
    
    1. We do not want to push any correlated expressions further down the plan 
tree.
       Deep correlation is not yet supported in Spark, and, even when supported,
       deep correlation is more difficult to be unnested to a join.
    2. We do not want to push any correlated subquery down because the 
correlated
       columns' ExprIds in the subquery may need to remap to different ExprIds 
from
       the plan below the current Filter that hosts the subquery.
    
    Another side effect is we used to push down Exists/In subquery as if it is a
    predicate in rule PushDownPredicate and rule PushPredicateThroughJoin. Now
    Exists/In subquery is rewritten to LeftSemi/LeftAnti, we need to handle
    the push down of LeftSemi/LeftAnti instead. This will be done in a followup
    commit.
    
    Another Todo is to merge the two-stage rewrite in rule 
PullupCorrelatedPredicates
    and rule RewritePredicateSubquery into a single stage rewrite.

commit dc3aa7e3dc51f01f1f322306eccb32d17a1de26e
Author: Nattavut Sutyanyong <nsy....@gmail.com>
Date:   2017-04-03T15:50:45Z

    Merge remote-tracking branch 'upstream/master'

commit 380d5d735401eb32d40fc8c3fd22d5d3f13a25de
Author: Nattavut Sutyanyong <nsy....@gmail.com>
Date:   2017-04-03T15:51:35Z

    Merge branch 'master' into phase2-1-clean

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to