[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...

2016-03-25 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/11745


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...

2016-03-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11745#issuecomment-197752404
  
**[Test build #53400 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53400/consoleFull)**
 for PR 11745 at commit 
[`b853e01`](https://github.com/apache/spark/commit/b853e01a281c575ccbf5e312840b749fd479fe69).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...

2016-03-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11745#issuecomment-197730305
  
**[Test build #53400 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53400/consoleFull)**
 for PR 11745 at commit 
[`b853e01`](https://github.com/apache/spark/commit/b853e01a281c575ccbf5e312840b749fd479fe69).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...

2016-03-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11745#issuecomment-197752855
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...

2016-03-19 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/11745#discussion_r56532685
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -306,14 +311,28 @@ object SetOperationPushDown extends Rule[LogicalPlan] 
with PredicateHelper {
 }
 
 /**
- * Attempts to eliminate the reading of unneeded columns from the query 
plan using the following
- * transformations:
+ * Attempts to eliminate the reading of unneeded columns from the query 
plan
+ * by pushing Project through Filter.
  *
- *  - Inserting Projections beneath the following operators:
- *   - Aggregate
- *   - Generate
- *   - Project <- Join
- *   - LeftSemiJoin
+ * Note: This rule could reverse the effects of 
PushPredicateThroughProject.
+ * This rule should be run before ColumnPruning for ensuring that Project 
can be
--- End diff --

Yeah, I have the same concern. This PR is just to resolve the conflicts 
based on the current infrastructure. 

In my opinion, in each batch, we need a few rule sets. The order of rule 
sets do not matter. In each rule set, the order of rules matters. However, this 
is a fundamental design change. @marmbrus @rxin might have a better idea in 
this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...

2016-03-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11745#issuecomment-197752864
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53400/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...

2016-03-19 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/11745#discussion_r56463073
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -306,14 +311,28 @@ object SetOperationPushDown extends Rule[LogicalPlan] 
with PredicateHelper {
 }
 
 /**
- * Attempts to eliminate the reading of unneeded columns from the query 
plan using the following
- * transformations:
+ * Attempts to eliminate the reading of unneeded columns from the query 
plan
+ * by pushing Project through Filter.
  *
- *  - Inserting Projections beneath the following operators:
- *   - Aggregate
- *   - Generate
- *   - Project <- Join
- *   - LeftSemiJoin
+ * Note: This rule could reverse the effects of 
PushPredicateThroughProject.
+ * This rule should be run before ColumnPruning for ensuring that Project 
can be
--- End diff --

I'm a little against to depending on rules order too much, sometimes we 
have to as other solutions are way too complex, but for this issue, can we try 
to find a more general solution?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...

2016-03-18 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/11745#discussion_r56705349
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -306,14 +311,28 @@ object SetOperationPushDown extends Rule[LogicalPlan] 
with PredicateHelper {
 }
 
 /**
- * Attempts to eliminate the reading of unneeded columns from the query 
plan using the following
- * transformations:
+ * Attempts to eliminate the reading of unneeded columns from the query 
plan
+ * by pushing Project through Filter.
  *
- *  - Inserting Projections beneath the following operators:
- *   - Aggregate
- *   - Generate
- *   - Project <- Join
- *   - LeftSemiJoin
+ * Note: This rule could reverse the effects of 
PushPredicateThroughProject.
+ * This rule should be run before ColumnPruning for ensuring that Project 
can be
+ * pushed as low as possible.
+ */
+object PushProjectThroughFilter extends Rule[LogicalPlan] {
--- End diff --

We does not actual PUSH project through filter, we create new Project 
before to prune some columns.

As I said in another PR, we remove the those Project before filter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...

2016-03-18 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/11745#discussion_r56744011
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -91,6 +92,10 @@ abstract class Optimizer extends 
RuleExecutor[LogicalPlan] {
   SimplifyCasts,
   SimplifyCaseConversionExpressions,
   EliminateSerialization) ::
+// Because ColumnPruning is called after PushPredicateThroughProject, 
the predicate push down
+// is reversed. This batch is to ensure Filter is pushed below 
Project, if possible.
+Batch("Push Predicate Through Project", Once,
+  PushPredicateThroughProject) ::
--- End diff --

oh, I missed that, sorry.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...

2016-03-18 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/11745#discussion_r56743924
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -91,6 +92,10 @@ abstract class Optimizer extends 
RuleExecutor[LogicalPlan] {
   SimplifyCasts,
   SimplifyCaseConversionExpressions,
   EliminateSerialization) ::
+// Because ColumnPruning is called after PushPredicateThroughProject, 
the predicate push down
+// is reversed. This batch is to ensure Filter is pushed below 
Project, if possible.
+Batch("Push Predicate Through Project", Once,
+  PushPredicateThroughProject) ::
--- End diff --

I did not remove it from the original batch. Just added the extra batch 
here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...

2016-03-18 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/11745#discussion_r56743896
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -91,6 +92,10 @@ abstract class Optimizer extends 
RuleExecutor[LogicalPlan] {
   SimplifyCasts,
   SimplifyCaseConversionExpressions,
   EliminateSerialization) ::
+// Because ColumnPruning is called after PushPredicateThroughProject, 
the predicate push down
+// is reversed. This batch is to ensure Filter is pushed below 
Project, if possible.
+Batch("Push Predicate Through Project", Once,
+  PushPredicateThroughProject) ::
--- End diff --

Put this role in a separate batch is not correct, some other filter push 
down rules depend on this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...

2016-03-18 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/11745#discussion_r56743788
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -306,14 +311,28 @@ object SetOperationPushDown extends Rule[LogicalPlan] 
with PredicateHelper {
 }
 
 /**
- * Attempts to eliminate the reading of unneeded columns from the query 
plan using the following
- * transformations:
+ * Attempts to eliminate the reading of unneeded columns from the query 
plan
+ * by pushing Project through Filter.
  *
- *  - Inserting Projections beneath the following operators:
- *   - Aggregate
- *   - Generate
- *   - Project <- Join
- *   - LeftSemiJoin
+ * Note: This rule could reverse the effects of 
PushPredicateThroughProject.
+ * This rule should be run before ColumnPruning for ensuring that Project 
can be
+ * pushed as low as possible.
+ */
+object PushProjectThroughFilter extends Rule[LogicalPlan] {
--- End diff --

@davies 

The naming of this rule is not right, but I still think this PR fixes the 
fundamental issue between the conflicts of `ColumnPruning` and 
`PushPredicateThroughProject`. If we do not take the ideas of this PR, I can 
find a test case to show the minor fix in `ColumnPruning` does not cover all 
the cases. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...

2016-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11745#issuecomment-197112611
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...

2016-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11745#issuecomment-197112616
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53248/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...

2016-03-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11745#issuecomment-197112418
  
**[Test build #53248 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53248/consoleFull)**
 for PR 11745 at commit 
[`c21748a`](https://github.com/apache/spark/commit/c21748aa5b3c08d25d878421f1465b9ea4e20371).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...

2016-03-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11745#issuecomment-197082803
  
**[Test build #53248 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53248/consoleFull)**
 for PR 11745 at commit 
[`c21748a`](https://github.com/apache/spark/commit/c21748aa5b3c08d25d878421f1465b9ea4e20371).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13919] [SQL] [WIP] Resolving the Confli...

2016-03-15 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/11745

[SPARK-13919] [SQL] [WIP] Resolving the Conflicts of ColumnPruning and 
PushPredicateThroughProject

 What changes were proposed in this pull request?

Now, `ColumnPruning` and `PushPredicateThroughProject` reverse each other's 
effect. Although it will not cause the max iteration now, some queries are not 
optimized to the best. 

For example, in the following query, 
```scala
val input = LocalRelation('a.int, 'b.string, 'c.double, 'd.int)
val originalQuery =
  input.select('a, 'b, 'c, 'd,
WindowExpression(
  AggregateExpression(Count('b), Complete, isDistinct = false),
  WindowSpecDefinition( 'a :: Nil,
SortOrder('b, Ascending) :: Nil,
UnspecifiedFrame)).as('window)).where('window > 1).select('a, 
'c)
```

After multiple iteration of two rules of {{ColumnPruning}} and 
{{PushPredicateThroughProject}}, the optimized plan we generated is like:
```
Project [a#0,c#0]   

 
+- Filter (window#0L > cast(1 as bigint))   

 
   +- Project [a#0,c#0,window#0L]   

 
  +- Window [(count(b#0),mode=Complete,isDistinct=false) 
windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND 
CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]   
 +- LocalRelation [a#0,b#0,c#0,d#0] 

 
```

However, the expected optimized plan should be like:
```
Project [a#0,c#0]
+- Filter (window#0L > cast(1 as bigint))
   +- Project [a#0,c#0,window#0L]
  +- Window [(count(b#0),mode=Complete,isDistinct=false) 
windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND 
CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]
 +- Project [a#0,b#0,c#0]
+- LocalRelation [a#0,b#0,c#0,d#0]  


```
 How was this patch tested?

The existing test cases already expose the problem, but we need to add more 
regression tests to ensure the future code changes will not break it. 

TODO: add more test cases.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark 
predicatePushDownOverColumnPruning

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11745.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11745


commit c6221a4b8985ff92c425899fd48a3845ec73eb38
Author: gatorsmile 
Date:   2016-03-15T19:06:32Z

Merge remote-tracking branch 'upstream/master' into 
predicatePushDownOverColumnPruning

commit c21748aa5b3c08d25d878421f1465b9ea4e20371
Author: gatorsmile 
Date:   2016-03-16T00:06:03Z

address the conflicts of two rules: PushPredicateThroughProject and 
PushProjectThroughFilter.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org