[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...

gatorsmile Tue, 15 Mar 2016 17:24:04 -0700

GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/11745


    [SPARK-13919] [SQL] [WIP] Resolving the Conflicts of ColumnPruning and 
PushPredicateThroughProject

    #### What changes were proposed in this pull request?
    
    Now, `ColumnPruning` and `PushPredicateThroughProject` reverse each other's 
effect. Although it will not cause the max iteration now, some queries are not 
optimized to the best. 
    
    For example, in the following query, 
    ```scala
        val input = LocalRelation('a.int, 'b.string, 'c.double, 'd.int)
        val originalQuery =
          input.select('a, 'b, 'c, 'd,
            WindowExpression(
              AggregateExpression(Count('b), Complete, isDistinct = false),
              WindowSpecDefinition( 'a :: Nil,
                SortOrder('b, Ascending) :: Nil,
                UnspecifiedFrame)).as('window)).where('window > 1).select('a, 
'c)
    ```
    
    After multiple iteration of two rules of {{ColumnPruning}} and 
{{PushPredicateThroughProject}}, the optimized plan we generated is like:
    ```
    Project [a#0,c#0]                                                           
                                                                                
                         
    +- Filter (window#0L > cast(1 as bigint))                                   
                                                                                
                         
       +- Project [a#0,c#0,window#0L]                                           
                                                                                
                         
          +- Window [(count(b#0),mode=Complete,isDistinct=false) 
windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND 
CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]   
             +- LocalRelation [a#0,b#0,c#0,d#0]                                 
                                                                                
                         
    ```
    
    However, the expected optimized plan should be like:
    ```
    Project [a#0,c#0]
    +- Filter (window#0L > cast(1 as bigint))
       +- Project [a#0,c#0,window#0L]
          +- Window [(count(b#0),mode=Complete,isDistinct=false) 
windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND 
CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]
             +- Project [a#0,b#0,c#0]
                +- LocalRelation [a#0,b#0,c#0,d#0]                              
                                                                                
                            
    ```
    #### How was this patch tested?
    
    The existing test cases already expose the problem, but we need to add more 
regression tests to ensure the future code changes will not break it. 
    
    TODO: add more test cases.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark 
predicatePushDownOverColumnPruning

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11745.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11745
    
----
commit c6221a4b8985ff92c425899fd48a3845ec73eb38
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-03-15T19:06:32Z

    Merge remote-tracking branch 'upstream/master' into 
predicatePushDownOverColumnPruning

commit c21748aa5b3c08d25d878421f1465b9ea4e20371
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-03-16T00:06:03Z

    address the conflicts of two rules: PushPredicateThroughProject and 
PushProjectThroughFilter.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...

Reply via email to