[GitHub] spark pull request #18918: [SPARK-21707][SQL]Improvement a special case for ...

2018-07-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18918


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18918: [SPARK-21707][SQL]Improvement a special case for ...

2017-11-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18918#discussion_r151533419
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -2029,4 +2029,13 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
   testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)),
   Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 
3, 1), Row(7, 3, 2)))
   }
+
+  test("SPARK-21707: nondeterministic expressions correctly for filter 
predicates") {
+withTempPath { path =>
+  val p = path.getAbsolutePath
+  Seq(1 -> "a").toDF("a", "b").write.partitionBy("a").parquet(p)
+  val df = spark.read.parquet(p)
+  checkAnswer(df.filter(rand(10) <= 1.0).select($"a"), Row(1))
--- End diff --

this test can pass on current master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18918: [SPARK-21707][SQL]Improvement a special case for ...

2017-08-25 Thread heary-cao
GitHub user heary-cao reopened a pull request:

https://github.com/apache/spark/pull/18918

[SPARK-21707][SQL]Improvement a special case for non-deterministic filters 
in optimizer

## What changes were proposed in this pull request?

Currently, Did a lot of special handling for non-deterministic projects and 
filters in optimizer. but not good enough. this patch add a new special case 
for non-deterministic filters. Deal with that we only need to read user needs 
fields for non-deterministic filters in optimizer.
For example, the condition of filters is nondeterministic. e.g:contains 
nondeterministic function(rand function), HiveTableScans optimizer generated:

```
HiveTableScans plan:Aggregate [k#2L], [k#2L, k#2L, sum(cast(id#1 as 
bigint)) AS sum(id)#395L]
+- Project [d004#205 AS id#1, CEIL(c010#214) AS k#2L]
   +- Filter ((isnotnull(d004#205) && (rand(-4530215890880734772) <= 0.5)) 
&& NOT (cast(cast(d004#205 as decimal(10,0)) as decimal(11,1)) = 0.0))
  +- MetastoreRelation XXX_database, XXX_table

HiveTableScans plan:Project [d004#205 AS id#1, CEIL(c010#214) AS k#2L]
+- Filter ((isnotnull(d004#205) && (rand(-4530215890880734772) <= 0.5)) && 
NOT (cast(cast(d004#205 as decimal(10,0)) as decimal(11,1)) = 0.0))
   +- MetastoreRelation XXX_database, XXX_table

HiveTableScans plan:Filter ((isnotnull(d004#205) && 
(rand(-4530215890880734772) <= 0.5)) && NOT (cast(cast(d004#205 as 
decimal(10,0)) as decimal(11,1)) = 0.0))
+- MetastoreRelation XXX_database, XXX_table

HiveTableScans plan:MetastoreRelation XXX_database, XXX_table

HiveTableScans result plan:HiveTableScan [c030#204L, d004#205, d005#206, 
d025#207, c002#208, d023#209, d024#210, c005#211L, c008#212, c009#213, 
c010#214, d021#215, d022#216, c017#217, c018#218, c019#219, c020#220, c021#221, 
c022#222, c023#223, c024#224, c025#225, c026#226, c027#227, ... 169 more 
fields], MetastoreRelation  XXX_database, XXX_table

```
so HiveTableScan will read all the fields from table. but we only need to 
‘d004’ and 'c010' . it will affect the performance of task.

## How was this patch tested?

Should be covered existing test cases and add new test cases.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/heary-cao/spark filters_non_deterministic

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18918.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18918


commit 97a32709f40c573bada4c46df0d00aad14425ee2
Author: caoxuewen 
Date:   2017-08-11T09:56:55Z

Improvement a special case for non-deterministic filters in optimizer




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18918: [SPARK-21707][SQL]Improvement a special case for ...

2017-08-22 Thread heary-cao
Github user heary-cao closed the pull request at:

https://github.com/apache/spark/pull/18918


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18918: [SPARK-21707][SQL]Improvement a special case for ...

2017-08-14 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18918#discussion_r132921051
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -522,6 +522,8 @@ object ColumnPruning extends Rule[LogicalPlan] {
* so remove it.
*/
   private def removeProjectBeforeFilter(plan: LogicalPlan): LogicalPlan = 
plan transform {
+case p1 @ Project(_, _ @ Filter(condition, _ @ Project(_, _: 
LeafNode)))
+  if !condition.deterministic => p1
--- End diff --

I don't get it from your explanation. If I understand it correctly, when 
there is a `Project` which selects subset of output from the `LeafNode`, if we 
remove it by the below pattern, we will retrieve all fields. Is it your purpose?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18918: [SPARK-21707][SQL]Improvement a special case for ...

2017-08-14 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18918#discussion_r132919858
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala
 ---
@@ -360,5 +360,34 @@ class ColumnPruningSuite extends PlanTest {
 comparePlans(optimized2, expected2.analyze)
   }
 
+  test("SPARK-21707 the condition of filter is not deterministic that 
split to two project ") {
--- End diff --

Actually I don't get what the test title tries to say. Can you try to 
rephrase it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18918: [SPARK-21707][SQL]Improvement a special case for ...

2017-08-14 Thread heary-cao
Github user heary-cao commented on a diff in the pull request:

https://github.com/apache/spark/pull/18918#discussion_r132902343
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -522,6 +522,8 @@ object ColumnPruning extends Rule[LogicalPlan] {
* so remove it.
*/
   private def removeProjectBeforeFilter(plan: LogicalPlan): LogicalPlan = 
plan transform {
+case p1 @ Project(_, _ @ Filter(condition, _ @ Project(_, _: 
LeafNode)))
+  if !condition.deterministic => p1
--- End diff --

When we split a child, we pruning it here again.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18918: [SPARK-21707][SQL]Improvement a special case for ...

2017-08-14 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18918#discussion_r132891196
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -522,6 +522,8 @@ object ColumnPruning extends Rule[LogicalPlan] {
* so remove it.
*/
   private def removeProjectBeforeFilter(plan: LogicalPlan): LogicalPlan = 
plan transform {
+case p1 @ Project(_, _ @ Filter(condition, _ @ Project(_, _: 
LeafNode)))
+  if !condition.deterministic => p1
--- End diff --

Why we can't remove the `Project` if the condition is not deterministic?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org