[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/12954


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/12954#issuecomment-217605162
  
LGTM, 
Merging this into master and 2.0 branch, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12954#issuecomment-217605013
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12954#issuecomment-217605015
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58047/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12954#issuecomment-217604963
  
**[Test build #58047 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58047/consoleFull)**
 for PR 12954 at commit 
[`e28bbb6`](https://github.com/apache/spark/commit/e28bbb6c90aef0a17caab5db8072327fcf93e59d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12954#issuecomment-217600276
  
**[Test build #58047 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58047/consoleFull)**
 for PR 12954 at commit 
[`e28bbb6`](https://github.com/apache/spark/commit/e28bbb6c90aef0a17caab5db8072327fcf93e59d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/12954#discussion_r62409983
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/BooleanSimplification.scala
 ---
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{And, GreaterThan, 
GreaterThanOrEqual, LessThan, LessThanOrEqual, Not, Or, PredicateHelper}
+import org.apache.spark.sql.catalyst.expressions.Literal._
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Simplifies boolean expressions:
+ * 1. Simplifies expressions whose answer can be determined without 
evaluating both sides.
+ * 2. Eliminates / extracts common factors.
+ * 3. Merge same expressions
+ * 4. Removes `Not` operator.
+ */
+object BooleanSimplification extends Rule[LogicalPlan] with 
PredicateHelper {
--- End diff --

import it into analyzer as temporary fix, we could think of a proper way 
later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/12954#discussion_r62409870
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/BooleanSimplification.scala
 ---
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{And, GreaterThan, 
GreaterThanOrEqual, LessThan, LessThanOrEqual, Not, Or, PredicateHelper}
+import org.apache.spark.sql.catalyst.expressions.Literal._
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Simplifies boolean expressions:
+ * 1. Simplifies expressions whose answer can be determined without 
evaluating both sides.
+ * 2. Eliminates / extracts common factors.
+ * 3. Merge same expressions
+ * 4. Removes `Not` operator.
+ */
+object BooleanSimplification extends Rule[LogicalPlan] with 
PredicateHelper {
--- End diff --

Ok, what do you suggest then?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/12954#discussion_r62409566
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/BooleanSimplification.scala
 ---
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{And, GreaterThan, 
GreaterThanOrEqual, LessThan, LessThanOrEqual, Not, Or, PredicateHelper}
+import org.apache.spark.sql.catalyst.expressions.Literal._
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Simplifies boolean expressions:
+ * 1. Simplifies expressions whose answer can be determined without 
evaluating both sides.
+ * 2. Eliminates / extracts common factors.
+ * 3. Merge same expressions
+ * 4. Removes `Not` operator.
+ */
+object BooleanSimplification extends Rule[LogicalPlan] with 
PredicateHelper {
--- End diff --

Moving this one into analysis is even worse ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12954#issuecomment-217580900
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58031/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12954#issuecomment-217580898
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12954#issuecomment-217580745
  
**[Test build #58031 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58031/consoleFull)**
 for PR 12954 at commit 
[`b487925`](https://github.com/apache/spark/commit/b487925dea798a2ae624242f3d9a18aa84f0b6fc).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12954#issuecomment-217560774
  
**[Test build #58031 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58031/consoleFull)**
 for PR 12954 at commit 
[`b487925`](https://github.com/apache/spark/commit/b487925dea798a2ae624242f3d9a18aa84f0b6fc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/12954#discussion_r62391466
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -26,6 +26,7 @@ import 
org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
 import org.apache.spark.sql.catalyst.encoders.OuterScopes
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.optimizer.BooleanSimplification
--- End diff --

I moved the rule in the latest commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/12954#discussion_r62391407
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1645,3 +1646,30 @@ object RewriteCorrelatedScalarSubquery extends 
Rule[LogicalPlan] {
   }
   }
 }
+
+/**
+ * Rewrite [[Filter]] plans that contain correlated [[ScalarSubquery]] 
expressions. When these
+ * correlated [[ScalarSubquery]] expressions are wrapped in a some 
Predicate expression, we rewrite
+ * them into [[PredicateSubquery]] expressions.
+ */
+object RewriteScalarSubqueriesInFilter extends Rule[LogicalPlan] {
--- End diff --

The advantage of Semi Join is that a subquery can actually return multiple 
results for one row without causing correctness problems. My initial approach 
was to relax the rules for scalar subquery (allow disjunctive predicates) and 
prevent possible duplicates by using left semis. This didn't work because I was 
also pulling non-correlated predicates through the aggregate (which makes its 
results invalid).

I am not sure which predicates you want to push through the left semi. 
Since all predicates that should be pushed down the left hand side are already 
in the predicate condition. But I might be missing something here.

I have remove this in my last commit. I do feel that this might be a small 
improvement over the current situation. Let's revisit this after Spark 2.0.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/12954#discussion_r62388521
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1645,3 +1646,30 @@ object RewriteCorrelatedScalarSubquery extends 
Rule[LogicalPlan] {
   }
   }
 }
+
+/**
+ * Rewrite [[Filter]] plans that contain correlated [[ScalarSubquery]] 
expressions. When these
+ * correlated [[ScalarSubquery]] expressions are wrapped in a some 
Predicate expression, we rewrite
+ * them into [[PredicateSubquery]] expressions.
+ */
+object RewriteScalarSubqueriesInFilter extends Rule[LogicalPlan] {
--- End diff --

With inner join, the predicate could be pushed down into subquery actually, 
but it was not with semi join. So it's not that obvious that which one is 
better.

After looking it for a long time, I'm not convinced it's correct ( or the 
best way to do it). Could you keep it out of this PR for now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/12954#discussion_r62382220
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1645,3 +1646,30 @@ object RewriteCorrelatedScalarSubquery extends 
Rule[LogicalPlan] {
   }
   }
 }
+
+/**
+ * Rewrite [[Filter]] plans that contain correlated [[ScalarSubquery]] 
expressions. When these
+ * correlated [[ScalarSubquery]] expressions are wrapped in a some 
Predicate expression, we rewrite
+ * them into [[PredicateSubquery]] expressions.
+ */
+object RewriteScalarSubqueriesInFilter extends Rule[LogicalPlan] {
--- End diff --

I'll remove it if you want to.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/12954#discussion_r62382186
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1645,3 +1646,30 @@ object RewriteCorrelatedScalarSubquery extends 
Rule[LogicalPlan] {
   }
   }
 }
+
+/**
+ * Rewrite [[Filter]] plans that contain correlated [[ScalarSubquery]] 
expressions. When these
+ * correlated [[ScalarSubquery]] expressions are wrapped in a some 
Predicate expression, we rewrite
+ * them into [[PredicateSubquery]] expressions.
+ */
+object RewriteScalarSubqueriesInFilter extends Rule[LogicalPlan] {
--- End diff --

It is not nessecary for this fix. It is left over from the initial attempt. 
I do think it adds value, since it allows us to use queries which might return 
multiple values for one key.

We do have a rule that promotes and outer join into an inner join, but we 
do not have one that turns a outer/inner join into a semi join.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/12954#discussion_r62377447
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1645,3 +1646,30 @@ object RewriteCorrelatedScalarSubquery extends 
Rule[LogicalPlan] {
   }
   }
 }
+
+/**
+ * Rewrite [[Filter]] plans that contain correlated [[ScalarSubquery]] 
expressions. When these
+ * correlated [[ScalarSubquery]] expressions are wrapped in a some 
Predicate expression, we rewrite
+ * them into [[PredicateSubquery]] expressions.
+ */
+object RewriteScalarSubqueriesInFilter extends Rule[LogicalPlan] {
--- End diff --

This rule is not necessary for this fix, right?

Also, we have a rule to turn outer join into semi join if possible, does 
that one not work in case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/12954#issuecomment-217533079
  
@hvanhovell never mind, I made a mistake in the query


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/12954#issuecomment-217527040
  
I still can't run Q41 after this PR:
```
 select distinct(i_product_name)
 from item i1
 where i_manufact_id between 738 and 738+40
   and (select count(*) as item_cnt
from item
where (i_manufact = i1.i_manufact and
(i_category = 'Women' and
(i_color = 'powder' or i_color = 'khaki') and
(i_units = 'Ounce' or i_units = 'Oz') and
(i_size = 'medium' or i_size = 'extra large')
) or
(i_category = 'Women' and
(i_color = 'brown' or i_color = 'honeydew') and
(i_units = 'Bunch' or i_units = 'Ton') and
(i_size = 'N/A' or i_size = 'small')
) or
(i_category = 'Men' and
(i_color = 'floral' or i_color = 'deep') and
(i_units = 'N/A' or i_units = 'Dozen') and
(i_size = 'petite' or i_size = 'large')
) or
(i_category = 'Men' and
(i_color = 'light' or i_color = 'cornflower') and
(i_units = 'Box' or i_units = 'Pound') and
(i_size = 'medium' or i_size = 'extra large')
))
or
(i_manufact = i1.i_manufact and
((i_category = 'Women' and
(i_color = 'midnight' or i_color = 'snow') and
(i_units = 'Pallet' or i_units = 'Gross') and
(i_size = 'medium' or i_size = 'extra large')
) or
(i_category = 'Women' and
(i_color = 'cyan' or i_color = 'papaya') and
(i_units = 'Cup' or i_units = 'Dram') and
(i_size = 'N/A' or i_size = 'small')
) or
(i_category = 'Men' and
(i_color = 'orange' or i_color = 'frosted') and
(i_units = 'Each' or i_units = 'Tbl') and
(i_size = 'petite' or i_size = 'large')
) or
(i_category = 'Men' and
(i_color = 'forest' or i_color = 'ghost') and
(i_units = 'Lb' or i_units = 'Bundle') and
(i_size = 'medium' or i_size = 'extra large')
 > 0
 order by i_product_name
 limit 100


Traceback (most recent call last):
  File "tpcds.py", line 117, in 
test(sys.argv[1], 2)
  File "tpcds.py", line 83, in test
sqlContext.sql(q).explain(False)
  File "/Users/davies/work/spark/python/pyspark/sql/context.py", line 345, 
in sql
return self.sparkSession.sql(sqlQuery)
  File "/Users/davies/work/spark/python/pyspark/sql/session.py", line 505, 
in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "//anaconda/lib/python2.7/site-packages/py4j/java_gateway.py", line 
538, in __call__
self.target_id, self.name)
  File "/Users/davies/work/spark/python/pyspark/sql/utils.py", line 63, in 
deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'The correlated scalar subquery can 
only contain equality predicates: i_category#215#484 = Women) && 
(i_manufact#217#486 = i_manufact#217) && ((i_color#220#485 = powder) || 
(i_color#220#485 = khaki))) && ((i_units#221#488 = Ounce) || (i_units#221#488 = 
Oz))) && ((i_size#218#487 = medium) || (i_size#218#487 = extra large))) || 
i_color#220#485 = brown) || (i_color#220#485 = honeydew)) && 
((i_units#221#488 = Bunch) || (i_units#221#488 = Ton))) && ((i_size#218#487 = 
N/A) || (i_size#218#487 = small) || ((i_category#215#484 = Men) && 
(i_color#220#485 = floral) || (i_color#220#485 = deep)) && 
((i_units#221#488 = N/A) || (i_units#221#488 = Dozen))) && ((i_size#218#487 = 
petite) || (i_size#218#487 = large))) || i_color#220#485 = light) || 
(i_color#220#485 = cornflower)) && ((i_units#221#488 = Box) || (i_units#221#488 
= Pound))) && ((i_size#218#487 = medium) || (i_size#218#487 = extra large)) 
|| ((i_manufact#217#486 = i_manu
 fact#217) && (((i_category#215#484 = Women) && (i_color#220#485 = 
midnight) || (i_color#220#485 = snow)) && ((i_units#221#488 = Pallet) || 
(i_units#221#488 = Gross))) && ((i_size#218#487 = medium) || (i_size#218#487 = 
extra large))) || i_color#220#485 = cyan) || (i_color#220#485 = papaya)) && 
((i_units#221#488 = Cup) || (i_units#221#488 = Dram))) && ((i_size#218#487 = 
N/A) || (i_size#218#487 = small) || ((i_category#215#484 = Men) && 
(i_color#220#485 = orange) || (i_color#220#485 = frosted)) && 
((i_units#221#488 = Each) || (i_units#221#488 = Tbl))) && ((i_size#218#487 = 
petite) || (i_size#218#487 = large))) || i_color#220#485 = forest) || 
(i_color#220#485 = ghost)) && ((i_units#221#488 = Lb) || (i_units#221#488 = 
Bundle))) && ((i_size#218#487 = medium) || (i_size#218#487 = extra 
large;'
```


---
If your project is set up for it, you can reply to this email and have your
reply appear 

[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/12954#discussion_r62357864
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -26,6 +26,7 @@ import 
org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
 import org.apache.spark.sql.catalyst.encoders.OuterScopes
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.optimizer.BooleanSimplification
--- End diff --

I totally agree on that. I could move the rule from `optimizer` to 
`analysis`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/12954#discussion_r62357413
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -26,6 +26,7 @@ import 
org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
 import org.apache.spark.sql.catalyst.encoders.OuterScopes
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.optimizer.BooleanSimplification
--- End diff --

It weird to use optimizer rule in analyzer. We could have this workaround 
for now, could you create a JIRA for do a proper fix later?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12954#issuecomment-217472088
  
**[Test build #57994 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57994/consoleFull)**
 for PR 12954 at commit 
[`f0871c9`](https://github.com/apache/spark/commit/f0871c921285a05602cf566c9f2c23901224d73e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12954#issuecomment-217472331
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57994/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12954#issuecomment-217472325
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12954#issuecomment-217447455
  
**[Test build #57994 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57994/consoleFull)**
 for PR 12954 at commit 
[`f0871c9`](https://github.com/apache/spark/commit/f0871c921285a05602cf566c9f2c23901224d73e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread hvanhovell
Github user hvanhovell commented on the pull request:

https://github.com/apache/spark/pull/12954#issuecomment-217447302
  
cc @rxin @davies


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize p...

2016-05-06 Thread hvanhovell
GitHub user hvanhovell opened a pull request:

https://github.com/apache/spark/pull/12954

[SPARK-15122][SQL] Fix TPC-DS 41 - Normalize predicates before pulling them 
out

## What changes were proposed in this pull request?
The official TPC-DS 41 query currently fails because it contains a scalar 
subquery with a disjunctive correlated predicate (the correlated predicates 
were nested in ORs). This makes the `Analyzer` pull out the entire predicate 
which is wrong and causes the following (correct) analysis exception: `The 
correlated scalar subquery can only contain equality predicates`

This PR fixes this by first simplifing (or normalizing) the correlated 
predicates before pulling them out of the subquery. I have also added a small 
optimizer rule that rewrites correlated scalar subqueries into predicate 
subqueries if they are used in a `Filter` and are wrapped by a predicate. This 
is allows us to use semi joins instead of left outer joins.

## How was this patch tested?
Manual testing on TPC-DS 41, and added a test to SubquerySuite.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hvanhovell/spark SPARK-15122

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12954.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12954


commit f0871c921285a05602cf566c9f2c23901224d73e
Author: Herman van Hovell 
Date:   2016-05-06T13:39:43Z

Fix TPC-DS 41 - normalize predicates before pulling them out.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org