GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/11745
[SPARK-13919] [SQL] [WIP] Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject #### What changes were proposed in this pull request? Now, `ColumnPruning` and `PushPredicateThroughProject` reverse each other's effect. Although it will not cause the max iteration now, some queries are not optimized to the best. For example, in the following query, ```scala val input = LocalRelation('a.int, 'b.string, 'c.double, 'd.int) val originalQuery = input.select('a, 'b, 'c, 'd, WindowExpression( AggregateExpression(Count('b), Complete, isDistinct = false), WindowSpecDefinition( 'a :: Nil, SortOrder('b, Ascending) :: Nil, UnspecifiedFrame)).as('window)).where('window > 1).select('a, 'c) ``` After multiple iteration of two rules of {{ColumnPruning}} and {{PushPredicateThroughProject}}, the optimized plan we generated is like: ``` Project [a#0,c#0] +- Filter (window#0L > cast(1 as bigint)) +- Project [a#0,c#0,window#0L] +- Window [(count(b#0),mode=Complete,isDistinct=false) windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS window#0L], [a#0], [b#0 ASC] +- LocalRelation [a#0,b#0,c#0,d#0] ``` However, the expected optimized plan should be like: ``` Project [a#0,c#0] +- Filter (window#0L > cast(1 as bigint)) +- Project [a#0,c#0,window#0L] +- Window [(count(b#0),mode=Complete,isDistinct=false) windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS window#0L], [a#0], [b#0 ASC] +- Project [a#0,b#0,c#0] +- LocalRelation [a#0,b#0,c#0,d#0] ``` #### How was this patch tested? The existing test cases already expose the problem, but we need to add more regression tests to ensure the future code changes will not break it. TODO: add more test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark predicatePushDownOverColumnPruning Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11745.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11745 ---- commit c6221a4b8985ff92c425899fd48a3845ec73eb38 Author: gatorsmile <gatorsm...@gmail.com> Date: 2016-03-15T19:06:32Z Merge remote-tracking branch 'upstream/master' into predicatePushDownOverColumnPruning commit c21748aa5b3c08d25d878421f1465b9ea4e20371 Author: gatorsmile <gatorsm...@gmail.com> Date: 2016-03-16T00:06:03Z address the conflicts of two rules: PushPredicateThroughProject and PushProjectThroughFilter. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org