[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/11745 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11745#issuecomment-197752404 **[Test build #53400 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53400/consoleFull)** for PR 11745 at commit [`b853e01`](https://github.com/apache/spark/commit/b853e01a281c575ccbf5e312840b749fd479fe69). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11745#issuecomment-197730305 **[Test build #53400 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53400/consoleFull)** for PR 11745 at commit [`b853e01`](https://github.com/apache/spark/commit/b853e01a281c575ccbf5e312840b749fd479fe69). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11745#issuecomment-197752855 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/11745#discussion_r56532685 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -306,14 +311,28 @@ object SetOperationPushDown extends Rule[LogicalPlan] with PredicateHelper { } /** - * Attempts to eliminate the reading of unneeded columns from the query plan using the following - * transformations: + * Attempts to eliminate the reading of unneeded columns from the query plan + * by pushing Project through Filter. * - * - Inserting Projections beneath the following operators: - * - Aggregate - * - Generate - * - Project <- Join - * - LeftSemiJoin + * Note: This rule could reverse the effects of PushPredicateThroughProject. + * This rule should be run before ColumnPruning for ensuring that Project can be --- End diff -- Yeah, I have the same concern. This PR is just to resolve the conflicts based on the current infrastructure. In my opinion, in each batch, we need a few rule sets. The order of rule sets do not matter. In each rule set, the order of rules matters. However, this is a fundamental design change. @marmbrus @rxin might have a better idea in this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11745#issuecomment-197752864 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53400/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/11745#discussion_r56463073 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -306,14 +311,28 @@ object SetOperationPushDown extends Rule[LogicalPlan] with PredicateHelper { } /** - * Attempts to eliminate the reading of unneeded columns from the query plan using the following - * transformations: + * Attempts to eliminate the reading of unneeded columns from the query plan + * by pushing Project through Filter. * - * - Inserting Projections beneath the following operators: - * - Aggregate - * - Generate - * - Project <- Join - * - LeftSemiJoin + * Note: This rule could reverse the effects of PushPredicateThroughProject. + * This rule should be run before ColumnPruning for ensuring that Project can be --- End diff -- I'm a little against to depending on rules order too much, sometimes we have to as other solutions are way too complex, but for this issue, can we try to find a more general solution? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/11745#discussion_r56705349 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -306,14 +311,28 @@ object SetOperationPushDown extends Rule[LogicalPlan] with PredicateHelper { } /** - * Attempts to eliminate the reading of unneeded columns from the query plan using the following - * transformations: + * Attempts to eliminate the reading of unneeded columns from the query plan + * by pushing Project through Filter. * - * - Inserting Projections beneath the following operators: - * - Aggregate - * - Generate - * - Project <- Join - * - LeftSemiJoin + * Note: This rule could reverse the effects of PushPredicateThroughProject. + * This rule should be run before ColumnPruning for ensuring that Project can be + * pushed as low as possible. + */ +object PushProjectThroughFilter extends Rule[LogicalPlan] { --- End diff -- We does not actual PUSH project through filter, we create new Project before to prune some columns. As I said in another PR, we remove the those Project before filter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/11745#discussion_r56744011 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -91,6 +92,10 @@ abstract class Optimizer extends RuleExecutor[LogicalPlan] { SimplifyCasts, SimplifyCaseConversionExpressions, EliminateSerialization) :: +// Because ColumnPruning is called after PushPredicateThroughProject, the predicate push down +// is reversed. This batch is to ensure Filter is pushed below Project, if possible. +Batch("Push Predicate Through Project", Once, + PushPredicateThroughProject) :: --- End diff -- oh, I missed that, sorry. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/11745#discussion_r56743924 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -91,6 +92,10 @@ abstract class Optimizer extends RuleExecutor[LogicalPlan] { SimplifyCasts, SimplifyCaseConversionExpressions, EliminateSerialization) :: +// Because ColumnPruning is called after PushPredicateThroughProject, the predicate push down +// is reversed. This batch is to ensure Filter is pushed below Project, if possible. +Batch("Push Predicate Through Project", Once, + PushPredicateThroughProject) :: --- End diff -- I did not remove it from the original batch. Just added the extra batch here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/11745#discussion_r56743896 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -91,6 +92,10 @@ abstract class Optimizer extends RuleExecutor[LogicalPlan] { SimplifyCasts, SimplifyCaseConversionExpressions, EliminateSerialization) :: +// Because ColumnPruning is called after PushPredicateThroughProject, the predicate push down +// is reversed. This batch is to ensure Filter is pushed below Project, if possible. +Batch("Push Predicate Through Project", Once, + PushPredicateThroughProject) :: --- End diff -- Put this role in a separate batch is not correct, some other filter push down rules depend on this one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/11745#discussion_r56743788 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -306,14 +311,28 @@ object SetOperationPushDown extends Rule[LogicalPlan] with PredicateHelper { } /** - * Attempts to eliminate the reading of unneeded columns from the query plan using the following - * transformations: + * Attempts to eliminate the reading of unneeded columns from the query plan + * by pushing Project through Filter. * - * - Inserting Projections beneath the following operators: - * - Aggregate - * - Generate - * - Project <- Join - * - LeftSemiJoin + * Note: This rule could reverse the effects of PushPredicateThroughProject. + * This rule should be run before ColumnPruning for ensuring that Project can be + * pushed as low as possible. + */ +object PushProjectThroughFilter extends Rule[LogicalPlan] { --- End diff -- @davies The naming of this rule is not right, but I still think this PR fixes the fundamental issue between the conflicts of `ColumnPruning` and `PushPredicateThroughProject`. If we do not take the ideas of this PR, I can find a test case to show the minor fix in `ColumnPruning` does not cover all the cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11745#issuecomment-197112611 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11745#issuecomment-197112616 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53248/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11745#issuecomment-197112418 **[Test build #53248 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53248/consoleFull)** for PR 11745 at commit [`c21748a`](https://github.com/apache/spark/commit/c21748aa5b3c08d25d878421f1465b9ea4e20371). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11745#issuecomment-197082803 **[Test build #53248 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53248/consoleFull)** for PR 11745 at commit [`c21748a`](https://github.com/apache/spark/commit/c21748aa5b3c08d25d878421f1465b9ea4e20371). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/11745 [SPARK-13919] [SQL] [WIP] Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject What changes were proposed in this pull request? Now, `ColumnPruning` and `PushPredicateThroughProject` reverse each other's effect. Although it will not cause the max iteration now, some queries are not optimized to the best. For example, in the following query, ```scala val input = LocalRelation('a.int, 'b.string, 'c.double, 'd.int) val originalQuery = input.select('a, 'b, 'c, 'd, WindowExpression( AggregateExpression(Count('b), Complete, isDistinct = false), WindowSpecDefinition( 'a :: Nil, SortOrder('b, Ascending) :: Nil, UnspecifiedFrame)).as('window)).where('window > 1).select('a, 'c) ``` After multiple iteration of two rules of {{ColumnPruning}} and {{PushPredicateThroughProject}}, the optimized plan we generated is like: ``` Project [a#0,c#0] +- Filter (window#0L > cast(1 as bigint)) +- Project [a#0,c#0,window#0L] +- Window [(count(b#0),mode=Complete,isDistinct=false) windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS window#0L], [a#0], [b#0 ASC] +- LocalRelation [a#0,b#0,c#0,d#0] ``` However, the expected optimized plan should be like: ``` Project [a#0,c#0] +- Filter (window#0L > cast(1 as bigint)) +- Project [a#0,c#0,window#0L] +- Window [(count(b#0),mode=Complete,isDistinct=false) windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS window#0L], [a#0], [b#0 ASC] +- Project [a#0,b#0,c#0] +- LocalRelation [a#0,b#0,c#0,d#0] ``` How was this patch tested? The existing test cases already expose the problem, but we need to add more regression tests to ensure the future code changes will not break it. TODO: add more test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark predicatePushDownOverColumnPruning Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11745.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11745 commit c6221a4b8985ff92c425899fd48a3845ec73eb38 Author: gatorsmile Date: 2016-03-15T19:06:32Z Merge remote-tracking branch 'upstream/master' into predicatePushDownOverColumnPruning commit c21748aa5b3c08d25d878421f1465b9ea4e20371 Author: gatorsmile Date: 2016-03-16T00:06:03Z address the conflicts of two rules: PushPredicateThroughProject and PushProjectThroughFilter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org