[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...

2018-12-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/23152


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...

2018-11-30 Thread adrian-wang
Github user adrian-wang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23152#discussion_r237808846
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2276,4 +2276,16 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
 }
   }
 
+
+  test("SPARK-26181 hasMinMaxStats method of ColumnStatsMap is not 
correct") {
+withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
+  withTable("all_null") {
+sql("create table all_null (attrInt int)")
+sql("insert into all_null values (null)")
+sql("analyze table all_null compute statistics for columns 
attrInt")
+checkAnswer(sql("select * from all_null where attrInt < 1"), Nil)
--- End diff --

I checked and it failed without this patch...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...

2018-11-30 Thread adrian-wang
Github user adrian-wang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23152#discussion_r237808676
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2276,4 +2276,16 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
 }
   }
 
+
+  test("SPARK-26181 hasMinMaxStats method of ColumnStatsMap is not 
correct") {
+withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
+  withTable("all_null") {
+sql("create table all_null (attrInt int)")
+sql("insert into all_null values (null)")
+sql("analyze table all_null compute statistics for columns 
attrInt")
+checkAnswer(sql("select * from all_null where attrInt < 1"), Nil)
--- End diff --

actually, when I remove the code change from source code and rerun this 
test, there's an exception being thrown.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...

2018-11-30 Thread juliuszsompolski
Github user juliuszsompolski commented on a diff in the pull request:

https://github.com/apache/spark/pull/23152#discussion_r237789881
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2276,4 +2276,16 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
 }
   }
 
+
+  test("SPARK-26181 hasMinMaxStats method of ColumnStatsMap is not 
correct") {
+withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
+  withTable("all_null") {
+sql("create table all_null (attrInt int)")
+sql("insert into all_null values (null)")
+sql("analyze table all_null compute statistics for columns 
attrInt")
+checkAnswer(sql("select * from all_null where attrInt < 1"), Nil)
--- End diff --

Normal query execution doens't trigger it here, because it doesn't need 
stats so they never get lazy evaluated.
Putting a join over it would probably trigger it without having to force it 
with .queryExecution.stringWithStats.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...

2018-11-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/23152#discussion_r237776897
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2276,4 +2276,16 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
 }
   }
 
+
+  test("SPARK-26181 hasMinMaxStats method of ColumnStatsMap is not 
correct") {
+withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
+  withTable("all_null") {
+sql("create table all_null (attrInt int)")
+sql("insert into all_null values (null)")
+sql("analyze table all_null compute statistics for columns 
attrInt")
+checkAnswer(sql("select * from all_null where attrInt < 1"), Nil)
--- End diff --

This will be triggered by `sql("select * from all_null where attrInt < 
1").queryExecution.stringWithStats`. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...

2018-11-29 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/23152#discussion_r237768463
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2276,4 +2276,16 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
 }
   }
 
+
+  test("SPARK-26181 hasMinMaxStats method of ColumnStatsMap is not 
correct") {
+withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
+  withTable("all_null") {
+sql("create table all_null (attrInt int)")
+sql("insert into all_null values (null)")
+sql("analyze table all_null compute statistics for columns 
attrInt")
+checkAnswer(sql("select * from all_null where attrInt < 1"), Nil)
--- End diff --

This test can pass without this patch.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...

2018-11-29 Thread adrian-wang
Github user adrian-wang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23152#discussion_r237735460
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -879,13 +879,13 @@ case class ColumnStatsMap(originalMap: 
AttributeMap[ColumnStat]) {
   }
 
   def hasCountStats(a: Attribute): Boolean =
-get(a).map(_.hasCountStats).getOrElse(false)
+get(a).exists(_.hasCountStats)
 
   def hasDistinctCount(a: Attribute): Boolean =
-get(a).map(_.distinctCount.isDefined).getOrElse(false)
+get(a).exists(_.distinctCount.isDefined)
 
   def hasMinMaxStats(a: Attribute): Boolean =
-get(a).map(_.hasCountStats).getOrElse(false)
+get(a).exists(_.hasMinMaxStats)
--- End diff --

sure, thank you all


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...

2018-11-29 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/23152#discussion_r237721273
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -879,13 +879,13 @@ case class ColumnStatsMap(originalMap: 
AttributeMap[ColumnStat]) {
   }
 
   def hasCountStats(a: Attribute): Boolean =
-get(a).map(_.hasCountStats).getOrElse(false)
+get(a).exists(_.hasCountStats)
 
   def hasDistinctCount(a: Attribute): Boolean =
-get(a).map(_.distinctCount.isDefined).getOrElse(false)
+get(a).exists(_.distinctCount.isDefined)
 
   def hasMinMaxStats(a: Attribute): Boolean =
-get(a).map(_.hasCountStats).getOrElse(false)
+get(a).exists(_.hasMinMaxStats)
--- End diff --

@liancheng Thanks. @adrian-wang Can you also add a test based on that?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...

2018-11-29 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/23152#discussion_r237720737
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -879,13 +879,13 @@ case class ColumnStatsMap(originalMap: 
AttributeMap[ColumnStat]) {
   }
 
   def hasCountStats(a: Attribute): Boolean =
-get(a).map(_.hasCountStats).getOrElse(false)
+get(a).exists(_.hasCountStats)
 
   def hasDistinctCount(a: Attribute): Boolean =
-get(a).map(_.distinctCount.isDefined).getOrElse(false)
+get(a).exists(_.distinctCount.isDefined)
 
   def hasMinMaxStats(a: Attribute): Boolean =
-get(a).map(_.hasCountStats).getOrElse(false)
+get(a).exists(_.hasMinMaxStats)
--- End diff --

So this is still an actual bug.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...

2018-11-29 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/23152#discussion_r237719791
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -879,13 +879,13 @@ case class ColumnStatsMap(originalMap: 
AttributeMap[ColumnStat]) {
   }
 
   def hasCountStats(a: Attribute): Boolean =
-get(a).map(_.hasCountStats).getOrElse(false)
+get(a).exists(_.hasCountStats)
 
   def hasDistinctCount(a: Attribute): Boolean =
-get(a).map(_.distinctCount.isDefined).getOrElse(false)
+get(a).exists(_.distinctCount.isDefined)
 
   def hasMinMaxStats(a: Attribute): Boolean =
-get(a).map(_.hasCountStats).getOrElse(false)
+get(a).exists(_.hasMinMaxStats)
--- End diff --

@viirya This issue is hard to trigger because usually the count, distinct 
count, and min/max statistics appear together, and therefore `hasMinMaxStats` 
always return the right result.

A numeric column with all null values is sufficient to reproduce this since 
we have the count and distinct count statistics but not min/max statistics in 
this case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...

2018-11-29 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/23152#discussion_r237717671
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -879,13 +879,13 @@ case class ColumnStatsMap(originalMap: 
AttributeMap[ColumnStat]) {
   }
 
   def hasCountStats(a: Attribute): Boolean =
-get(a).map(_.hasCountStats).getOrElse(false)
+get(a).exists(_.hasCountStats)
 
   def hasDistinctCount(a: Attribute): Boolean =
-get(a).map(_.distinctCount.isDefined).getOrElse(false)
+get(a).exists(_.distinctCount.isDefined)
 
   def hasMinMaxStats(a: Attribute): Boolean =
-get(a).map(_.hasCountStats).getOrElse(false)
+get(a).exists(_.hasMinMaxStats)
--- End diff --

This is a copy-and-paste bug. You can reproduce it by
```
spark.sql("create table Foo1(a int)")
spark.sql("insert into Foo1 values (null)")
spark.sql("analyze table Foo1 compute statistics for columns a")
spark.sql("select * from Foo1 where a < 1").queryExecution.stringWithStats
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...

2018-11-29 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/23152#discussion_r237382531
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -879,13 +879,13 @@ case class ColumnStatsMap(originalMap: 
AttributeMap[ColumnStat]) {
   }
 
   def hasCountStats(a: Attribute): Boolean =
-get(a).map(_.hasCountStats).getOrElse(false)
+get(a).exists(_.hasCountStats)
 
   def hasDistinctCount(a: Attribute): Boolean =
-get(a).map(_.distinctCount.isDefined).getOrElse(false)
+get(a).exists(_.distinctCount.isDefined)
 
   def hasMinMaxStats(a: Attribute): Boolean =
-get(a).map(_.hasCountStats).getOrElse(false)
+get(a).exists(_.hasMinMaxStats)
--- End diff --

Based on https://github.com/apache/spark/pull/23152#discussion_r237359357, 
the old `hasMinMaxStats` actually work correctly although it looks not at a 
glance.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...

2018-11-28 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/23152#discussion_r237381966
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -821,6 +822,32 @@ class FilterEstimationSuite extends 
StatsEstimationTestBase {
   expectedRowCount = 3)
   }
 
+  test("ColumnStatsMap tests") {
--- End diff --

Oh I see. So `FilterEstimation` works well without this change. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...

2018-11-28 Thread adrian-wang
Github user adrian-wang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23152#discussion_r237380128
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -821,6 +822,32 @@ class FilterEstimationSuite extends 
StatsEstimationTestBase {
   expectedRowCount = 3)
   }
 
+  test("ColumnStatsMap tests") {
--- End diff --

There's no public call to `evaluateBinary` method actually.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...

2018-11-28 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/23152#discussion_r237359890
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -821,6 +822,32 @@ class FilterEstimationSuite extends 
StatsEstimationTestBase {
   expectedRowCount = 3)
   }
 
+  test("ColumnStatsMap tests") {
--- End diff --

If `hasMinMaxStats` is not correct previously, will `evaluateBinary` work 
correctly?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...

2018-11-28 Thread adrian-wang
Github user adrian-wang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23152#discussion_r237359357
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -821,6 +822,32 @@ class FilterEstimationSuite extends 
StatsEstimationTestBase {
   expectedRowCount = 3)
   }
 
+  test("ColumnStatsMap tests") {
--- End diff --

Hi @viirya , thanks for your review. Here the `FilterEstimation` works 
because we only call `FilterEstimation.estimate` to check it, in which we check 
if `rowCount.isEmpty` first and shortcut the estimation. If `rowCount` is not 
empty, we should always has a `min` and `max` in the stats, unless it is 
malformed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...

2018-11-28 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/23152#discussion_r237347496
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -821,6 +822,32 @@ class FilterEstimationSuite extends 
StatsEstimationTestBase {
   expectedRowCount = 3)
   }
 
+  test("ColumnStatsMap tests") {
--- End diff --

Since this is not correct, I think that FilterEstimation didn't work 
previously. Can you also add a related test using validateEstimatedStats 
similar to other tests? In the test, we should rely on min and max stats to do 
filter estimation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...

2018-11-28 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/23152#discussion_r237111541
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -879,13 +879,13 @@ case class ColumnStatsMap(originalMap: 
AttributeMap[ColumnStat]) {
   }
 
   def hasCountStats(a: Attribute): Boolean =
-get(a).map(_.hasCountStats).getOrElse(false)
+get(a).exists(_.hasCountStats)
 
   def hasDistinctCount(a: Attribute): Boolean =
-get(a).map(_.distinctCount.isDefined).getOrElse(false)
+get(a).exists(_.distinctCount.isDefined)
 
   def hasMinMaxStats(a: Attribute): Boolean =
-get(a).map(_.hasCountStats).getOrElse(false)
+get(a).exists(_.hasMinMaxStats)
--- End diff --

Can you add a test for it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...

2018-11-26 Thread adrian-wang
GitHub user adrian-wang opened a pull request:

https://github.com/apache/spark/pull/23152

[SPARK-26181][SQL] the `hasMinMaxStats` method of `ColumnStatsMap` is not 
correct

## What changes were proposed in this pull request?

For now the `hasMinMaxStats` will return the same as `hasCountStats`, which 
is obviously not as expected.

## How was this patch tested?

Existing tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/adrian-wang/spark minmaxstats

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23152.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23152


commit 1c1edccc43ac9cdb6da53ad93ac998b06bf8b6dd
Author: Daoyuan Wang 
Date:   2018-11-27T06:18:34Z

fix ColumnStatsMap.hasMinMaxStats




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org