subject:"spark git commit\: \[SPARK\-22983\] Don't push filters beneath aggregates with empty grouping expressions"

spark git commit: [SPARK-22983] Don't push filters beneath aggregates with empty grouping expressions

2018-01-08 Thread lixiao

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 41f705a57 -> 7c30ae39f


[SPARK-22983] Don't push filters beneath aggregates with empty grouping 
expressions

## What changes were proposed in this pull request?

The following SQL query should return zero rows, but in Spark it actually 
returns one row:

```
SELECT 1 from (
  SELECT 1 AS z,
  MIN(a.x)
  FROM (select 1 as x) a
  WHERE false
) b
where b.z != b.z
```

The problem stems from the `PushDownPredicate` rule: when this rule encounters 
a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, it removes 
the original filter and adds a new filter onto Aggregate's child, e.g. 
`Agg(Filter(...))`. This is sometimes okay, but the case above is a 
counterexample: because there is no explicit `GROUP BY`, we are implicitly 
computing a global aggregate over the entire table so the original filter was 
not acting like a `HAVING` clause filtering the number of groups: if we push 
this filter then it fails to actually reduce the cardinality of the Aggregate 
output, leading to the wrong answer.

In 2016 I fixed a similar problem involving invalid pushdowns of 
data-independent filters (filters which reference no columns of the filtered 
relation). There was additional discussion after my fix was merged which 
pointed out that my patch was an incomplete fix (see #15289), but it looks I 
must have either misunderstood the comment or forgot to follow up on the 
additional points raised there.

This patch fixes the problem by choosing to never push down filters in cases 
where there are no grouping expressions. Since there are no grouping keys, the 
only columns are aggregate columns and we can't push filters defined over 
aggregate results, so this change won't cause us to miss out on any legitimate 
pushdown opportunities.

## How was this patch tested?

New regression tests in `SQLQueryTestSuite` and `FilterPushdownSuite`.

Author: Josh Rosen 

Closes #20180 from 
JoshRosen/SPARK-22983-dont-push-filters-beneath-aggs-with-empty-grouping-expressions.

(cherry picked from commit 2c73d2a948bdde798aaf0f87c18846281deb05fd)
Signed-off-by: gatorsmile 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7c30ae39
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7c30ae39
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7c30ae39

Branch: refs/heads/branch-2.2
Commit: 7c30ae39f57ef0c42173b52aa405027b44e0ad9f
Parents: 41f705a
Author: Josh Rosen 
Authored: Mon Jan 8 16:04:03 2018 +0800
Committer: gatorsmile 
Committed: Mon Jan 8 16:05:04 2018 +0800

--
 .../spark/sql/catalyst/optimizer/Optimizer.scala|  3 ++-
 .../catalyst/optimizer/FilterPushdownSuite.scala| 13 +
 .../test/resources/sql-tests/inputs/group-by.sql|  9 +
 .../resources/sql-tests/results/group-by.sql.out| 16 +++-
 4 files changed, 39 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7c30ae39/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
index 82bd759..fe66821 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
@@ -754,7 +754,8 @@ object PushDownPredicate extends Rule[LogicalPlan] with 
PredicateHelper {
   project.copy(child = Filter(replaceAlias(condition, aliasMap), 
grandChild))
 
 case filter @ Filter(condition, aggregate: Aggregate)
-  if aggregate.aggregateExpressions.forall(_.deterministic) =>
+  if aggregate.aggregateExpressions.forall(_.deterministic)
+&& aggregate.groupingExpressions.nonEmpty =>
   // Find all the aliased expressions in the aggregate list that don't 
include any actual
   // AggregateExpression, and create a map from the alias to the expression
   val aliasMap = AttributeMap(aggregate.aggregateExpressions.collect {

http://git-wip-us.apache.org/repos/asf/spark/blob/7c30ae39/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala
index d4d281e..4d41354 100644
---

spark git commit: [SPARK-22983] Don't push filters beneath aggregates with empty grouping expressions

2018-01-08 Thread lixiao

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 8bf24e9fe -> 6964dfe47


[SPARK-22983] Don't push filters beneath aggregates with empty grouping 
expressions

## What changes were proposed in this pull request?

The following SQL query should return zero rows, but in Spark it actually 
returns one row:

```
SELECT 1 from (
  SELECT 1 AS z,
  MIN(a.x)
  FROM (select 1 as x) a
  WHERE false
) b
where b.z != b.z
```

The problem stems from the `PushDownPredicate` rule: when this rule encounters 
a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, it removes 
the original filter and adds a new filter onto Aggregate's child, e.g. 
`Agg(Filter(...))`. This is sometimes okay, but the case above is a 
counterexample: because there is no explicit `GROUP BY`, we are implicitly 
computing a global aggregate over the entire table so the original filter was 
not acting like a `HAVING` clause filtering the number of groups: if we push 
this filter then it fails to actually reduce the cardinality of the Aggregate 
output, leading to the wrong answer.

In 2016 I fixed a similar problem involving invalid pushdowns of 
data-independent filters (filters which reference no columns of the filtered 
relation). There was additional discussion after my fix was merged which 
pointed out that my patch was an incomplete fix (see #15289), but it looks I 
must have either misunderstood the comment or forgot to follow up on the 
additional points raised there.

This patch fixes the problem by choosing to never push down filters in cases 
where there are no grouping expressions. Since there are no grouping keys, the 
only columns are aggregate columns and we can't push filters defined over 
aggregate results, so this change won't cause us to miss out on any legitimate 
pushdown opportunities.

## How was this patch tested?

New regression tests in `SQLQueryTestSuite` and `FilterPushdownSuite`.

Author: Josh Rosen 

Closes #20180 from 
JoshRosen/SPARK-22983-dont-push-filters-beneath-aggs-with-empty-grouping-expressions.

(cherry picked from commit 2c73d2a948bdde798aaf0f87c18846281deb05fd)
Signed-off-by: gatorsmile 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6964dfe4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6964dfe4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6964dfe4

Branch: refs/heads/branch-2.3
Commit: 6964dfe47b2090e542b26cd64e27420ec3eb1a3d
Parents: 8bf24e9
Author: Josh Rosen 
Authored: Mon Jan 8 16:04:03 2018 +0800
Committer: gatorsmile 
Committed: Mon Jan 8 16:04:28 2018 +0800

--
 .../spark/sql/catalyst/optimizer/Optimizer.scala|  3 ++-
 .../catalyst/optimizer/FilterPushdownSuite.scala| 13 +
 .../test/resources/sql-tests/inputs/group-by.sql|  9 +
 .../resources/sql-tests/results/group-by.sql.out| 16 +++-
 4 files changed, 39 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6964dfe4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
index 0d4b02c..df0af82 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
@@ -795,7 +795,8 @@ object PushDownPredicate extends Rule[LogicalPlan] with 
PredicateHelper {
   project.copy(child = Filter(replaceAlias(condition, aliasMap), 
grandChild))
 
 case filter @ Filter(condition, aggregate: Aggregate)
-  if aggregate.aggregateExpressions.forall(_.deterministic) =>
+  if aggregate.aggregateExpressions.forall(_.deterministic)
+&& aggregate.groupingExpressions.nonEmpty =>
   // Find all the aliased expressions in the aggregate list that don't 
include any actual
   // AggregateExpression, and create a map from the alias to the expression
   val aliasMap = AttributeMap(aggregate.aggregateExpressions.collect {

http://git-wip-us.apache.org/repos/asf/spark/blob/6964dfe4/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala
index 85a5e97..82a1025 100644
---

spark git commit: [SPARK-22983] Don't push filters beneath aggregates with empty grouping expressions

2018-01-08 Thread lixiao

Repository: spark
Updated Branches:
  refs/heads/master 8fdeb4b99 -> 2c73d2a94


[SPARK-22983] Don't push filters beneath aggregates with empty grouping 
expressions

## What changes were proposed in this pull request?

The following SQL query should return zero rows, but in Spark it actually 
returns one row:

```
SELECT 1 from (
  SELECT 1 AS z,
  MIN(a.x)
  FROM (select 1 as x) a
  WHERE false
) b
where b.z != b.z
```

The problem stems from the `PushDownPredicate` rule: when this rule encounters 
a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, it removes 
the original filter and adds a new filter onto Aggregate's child, e.g. 
`Agg(Filter(...))`. This is sometimes okay, but the case above is a 
counterexample: because there is no explicit `GROUP BY`, we are implicitly 
computing a global aggregate over the entire table so the original filter was 
not acting like a `HAVING` clause filtering the number of groups: if we push 
this filter then it fails to actually reduce the cardinality of the Aggregate 
output, leading to the wrong answer.

In 2016 I fixed a similar problem involving invalid pushdowns of 
data-independent filters (filters which reference no columns of the filtered 
relation). There was additional discussion after my fix was merged which 
pointed out that my patch was an incomplete fix (see #15289), but it looks I 
must have either misunderstood the comment or forgot to follow up on the 
additional points raised there.

This patch fixes the problem by choosing to never push down filters in cases 
where there are no grouping expressions. Since there are no grouping keys, the 
only columns are aggregate columns and we can't push filters defined over 
aggregate results, so this change won't cause us to miss out on any legitimate 
pushdown opportunities.

## How was this patch tested?

New regression tests in `SQLQueryTestSuite` and `FilterPushdownSuite`.

Author: Josh Rosen 

Closes #20180 from 
JoshRosen/SPARK-22983-dont-push-filters-beneath-aggs-with-empty-grouping-expressions.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2c73d2a9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2c73d2a9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2c73d2a9

Branch: refs/heads/master
Commit: 2c73d2a948bdde798aaf0f87c18846281deb05fd
Parents: 8fdeb4b
Author: Josh Rosen 
Authored: Mon Jan 8 16:04:03 2018 +0800
Committer: gatorsmile 
Committed: Mon Jan 8 16:04:03 2018 +0800

--
 .../spark/sql/catalyst/optimizer/Optimizer.scala|  3 ++-
 .../catalyst/optimizer/FilterPushdownSuite.scala| 13 +
 .../test/resources/sql-tests/inputs/group-by.sql|  9 +
 .../resources/sql-tests/results/group-by.sql.out| 16 +++-
 4 files changed, 39 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2c73d2a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
index 0d4b02c..df0af82 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
@@ -795,7 +795,8 @@ object PushDownPredicate extends Rule[LogicalPlan] with 
PredicateHelper {
   project.copy(child = Filter(replaceAlias(condition, aliasMap), 
grandChild))
 
 case filter @ Filter(condition, aggregate: Aggregate)
-  if aggregate.aggregateExpressions.forall(_.deterministic) =>
+  if aggregate.aggregateExpressions.forall(_.deterministic)
+&& aggregate.groupingExpressions.nonEmpty =>
   // Find all the aliased expressions in the aggregate list that don't 
include any actual
   // AggregateExpression, and create a map from the alias to the expression
   val aliasMap = AttributeMap(aggregate.aggregateExpressions.collect {

http://git-wip-us.apache.org/repos/asf/spark/blob/2c73d2a9/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala
index 85a5e97..82a1025 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala
+++

spark git commit: [SPARK-22983] Don't push filters beneath aggregates with empty grouping expressions

spark git commit: [SPARK-22983] Don't push filters beneath aggregates with empty grouping expressions

spark git commit: [SPARK-22983] Don't push filters beneath aggregates with empty grouping expressions

3 matches

Site Navigation

Mail list logo

Footer information