[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23152 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...
Github user adrian-wang commented on a diff in the pull request: https://github.com/apache/spark/pull/23152#discussion_r237808846 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2276,4 +2276,16 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } + + test("SPARK-26181 hasMinMaxStats method of ColumnStatsMap is not correct") { +withSQLConf(SQLConf.CBO_ENABLED.key -> "true") { + withTable("all_null") { +sql("create table all_null (attrInt int)") +sql("insert into all_null values (null)") +sql("analyze table all_null compute statistics for columns attrInt") +checkAnswer(sql("select * from all_null where attrInt < 1"), Nil) --- End diff -- I checked and it failed without this patch... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...
Github user adrian-wang commented on a diff in the pull request: https://github.com/apache/spark/pull/23152#discussion_r237808676 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2276,4 +2276,16 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } + + test("SPARK-26181 hasMinMaxStats method of ColumnStatsMap is not correct") { +withSQLConf(SQLConf.CBO_ENABLED.key -> "true") { + withTable("all_null") { +sql("create table all_null (attrInt int)") +sql("insert into all_null values (null)") +sql("analyze table all_null compute statistics for columns attrInt") +checkAnswer(sql("select * from all_null where attrInt < 1"), Nil) --- End diff -- actually, when I remove the code change from source code and rerun this test, there's an exception being thrown. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...
Github user juliuszsompolski commented on a diff in the pull request: https://github.com/apache/spark/pull/23152#discussion_r237789881 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2276,4 +2276,16 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } + + test("SPARK-26181 hasMinMaxStats method of ColumnStatsMap is not correct") { +withSQLConf(SQLConf.CBO_ENABLED.key -> "true") { + withTable("all_null") { +sql("create table all_null (attrInt int)") +sql("insert into all_null values (null)") +sql("analyze table all_null compute statistics for columns attrInt") +checkAnswer(sql("select * from all_null where attrInt < 1"), Nil) --- End diff -- Normal query execution doens't trigger it here, because it doesn't need stats so they never get lazy evaluated. Putting a join over it would probably trigger it without having to force it with .queryExecution.stringWithStats. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/23152#discussion_r237776897 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2276,4 +2276,16 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } + + test("SPARK-26181 hasMinMaxStats method of ColumnStatsMap is not correct") { +withSQLConf(SQLConf.CBO_ENABLED.key -> "true") { + withTable("all_null") { +sql("create table all_null (attrInt int)") +sql("insert into all_null values (null)") +sql("analyze table all_null compute statistics for columns attrInt") +checkAnswer(sql("select * from all_null where attrInt < 1"), Nil) --- End diff -- This will be triggered by `sql("select * from all_null where attrInt < 1").queryExecution.stringWithStats`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23152#discussion_r237768463 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2276,4 +2276,16 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } + + test("SPARK-26181 hasMinMaxStats method of ColumnStatsMap is not correct") { +withSQLConf(SQLConf.CBO_ENABLED.key -> "true") { + withTable("all_null") { +sql("create table all_null (attrInt int)") +sql("insert into all_null values (null)") +sql("analyze table all_null compute statistics for columns attrInt") +checkAnswer(sql("select * from all_null where attrInt < 1"), Nil) --- End diff -- This test can pass without this patch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...
Github user adrian-wang commented on a diff in the pull request: https://github.com/apache/spark/pull/23152#discussion_r237735460 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -879,13 +879,13 @@ case class ColumnStatsMap(originalMap: AttributeMap[ColumnStat]) { } def hasCountStats(a: Attribute): Boolean = -get(a).map(_.hasCountStats).getOrElse(false) +get(a).exists(_.hasCountStats) def hasDistinctCount(a: Attribute): Boolean = -get(a).map(_.distinctCount.isDefined).getOrElse(false) +get(a).exists(_.distinctCount.isDefined) def hasMinMaxStats(a: Attribute): Boolean = -get(a).map(_.hasCountStats).getOrElse(false) +get(a).exists(_.hasMinMaxStats) --- End diff -- sure, thank you all --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23152#discussion_r237721273 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -879,13 +879,13 @@ case class ColumnStatsMap(originalMap: AttributeMap[ColumnStat]) { } def hasCountStats(a: Attribute): Boolean = -get(a).map(_.hasCountStats).getOrElse(false) +get(a).exists(_.hasCountStats) def hasDistinctCount(a: Attribute): Boolean = -get(a).map(_.distinctCount.isDefined).getOrElse(false) +get(a).exists(_.distinctCount.isDefined) def hasMinMaxStats(a: Attribute): Boolean = -get(a).map(_.hasCountStats).getOrElse(false) +get(a).exists(_.hasMinMaxStats) --- End diff -- @liancheng Thanks. @adrian-wang Can you also add a test based on that? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23152#discussion_r237720737 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -879,13 +879,13 @@ case class ColumnStatsMap(originalMap: AttributeMap[ColumnStat]) { } def hasCountStats(a: Attribute): Boolean = -get(a).map(_.hasCountStats).getOrElse(false) +get(a).exists(_.hasCountStats) def hasDistinctCount(a: Attribute): Boolean = -get(a).map(_.distinctCount.isDefined).getOrElse(false) +get(a).exists(_.distinctCount.isDefined) def hasMinMaxStats(a: Attribute): Boolean = -get(a).map(_.hasCountStats).getOrElse(false) +get(a).exists(_.hasMinMaxStats) --- End diff -- So this is still an actual bug. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/23152#discussion_r237719791 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -879,13 +879,13 @@ case class ColumnStatsMap(originalMap: AttributeMap[ColumnStat]) { } def hasCountStats(a: Attribute): Boolean = -get(a).map(_.hasCountStats).getOrElse(false) +get(a).exists(_.hasCountStats) def hasDistinctCount(a: Attribute): Boolean = -get(a).map(_.distinctCount.isDefined).getOrElse(false) +get(a).exists(_.distinctCount.isDefined) def hasMinMaxStats(a: Attribute): Boolean = -get(a).map(_.hasCountStats).getOrElse(false) +get(a).exists(_.hasMinMaxStats) --- End diff -- @viirya This issue is hard to trigger because usually the count, distinct count, and min/max statistics appear together, and therefore `hasMinMaxStats` always return the right result. A numeric column with all null values is sufficient to reproduce this since we have the count and distinct count statistics but not min/max statistics in this case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/23152#discussion_r237717671 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -879,13 +879,13 @@ case class ColumnStatsMap(originalMap: AttributeMap[ColumnStat]) { } def hasCountStats(a: Attribute): Boolean = -get(a).map(_.hasCountStats).getOrElse(false) +get(a).exists(_.hasCountStats) def hasDistinctCount(a: Attribute): Boolean = -get(a).map(_.distinctCount.isDefined).getOrElse(false) +get(a).exists(_.distinctCount.isDefined) def hasMinMaxStats(a: Attribute): Boolean = -get(a).map(_.hasCountStats).getOrElse(false) +get(a).exists(_.hasMinMaxStats) --- End diff -- This is a copy-and-paste bug. You can reproduce it by ``` spark.sql("create table Foo1(a int)") spark.sql("insert into Foo1 values (null)") spark.sql("analyze table Foo1 compute statistics for columns a") spark.sql("select * from Foo1 where a < 1").queryExecution.stringWithStats ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23152#discussion_r237382531 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -879,13 +879,13 @@ case class ColumnStatsMap(originalMap: AttributeMap[ColumnStat]) { } def hasCountStats(a: Attribute): Boolean = -get(a).map(_.hasCountStats).getOrElse(false) +get(a).exists(_.hasCountStats) def hasDistinctCount(a: Attribute): Boolean = -get(a).map(_.distinctCount.isDefined).getOrElse(false) +get(a).exists(_.distinctCount.isDefined) def hasMinMaxStats(a: Attribute): Boolean = -get(a).map(_.hasCountStats).getOrElse(false) +get(a).exists(_.hasMinMaxStats) --- End diff -- Based on https://github.com/apache/spark/pull/23152#discussion_r237359357, the old `hasMinMaxStats` actually work correctly although it looks not at a glance. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23152#discussion_r237381966 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -821,6 +822,32 @@ class FilterEstimationSuite extends StatsEstimationTestBase { expectedRowCount = 3) } + test("ColumnStatsMap tests") { --- End diff -- Oh I see. So `FilterEstimation` works well without this change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...
Github user adrian-wang commented on a diff in the pull request: https://github.com/apache/spark/pull/23152#discussion_r237380128 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -821,6 +822,32 @@ class FilterEstimationSuite extends StatsEstimationTestBase { expectedRowCount = 3) } + test("ColumnStatsMap tests") { --- End diff -- There's no public call to `evaluateBinary` method actually. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23152#discussion_r237359890 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -821,6 +822,32 @@ class FilterEstimationSuite extends StatsEstimationTestBase { expectedRowCount = 3) } + test("ColumnStatsMap tests") { --- End diff -- If `hasMinMaxStats` is not correct previously, will `evaluateBinary` work correctly? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...
Github user adrian-wang commented on a diff in the pull request: https://github.com/apache/spark/pull/23152#discussion_r237359357 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -821,6 +822,32 @@ class FilterEstimationSuite extends StatsEstimationTestBase { expectedRowCount = 3) } + test("ColumnStatsMap tests") { --- End diff -- Hi @viirya , thanks for your review. Here the `FilterEstimation` works because we only call `FilterEstimation.estimate` to check it, in which we check if `rowCount.isEmpty` first and shortcut the estimation. If `rowCount` is not empty, we should always has a `min` and `max` in the stats, unless it is malformed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23152#discussion_r237347496 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -821,6 +822,32 @@ class FilterEstimationSuite extends StatsEstimationTestBase { expectedRowCount = 3) } + test("ColumnStatsMap tests") { --- End diff -- Since this is not correct, I think that FilterEstimation didn't work previously. Can you also add a related test using validateEstimatedStats similar to other tests? In the test, we should rely on min and max stats to do filter estimation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23152#discussion_r237111541 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -879,13 +879,13 @@ case class ColumnStatsMap(originalMap: AttributeMap[ColumnStat]) { } def hasCountStats(a: Attribute): Boolean = -get(a).map(_.hasCountStats).getOrElse(false) +get(a).exists(_.hasCountStats) def hasDistinctCount(a: Attribute): Boolean = -get(a).map(_.distinctCount.isDefined).getOrElse(false) +get(a).exists(_.distinctCount.isDefined) def hasMinMaxStats(a: Attribute): Boolean = -get(a).map(_.hasCountStats).getOrElse(false) +get(a).exists(_.hasMinMaxStats) --- End diff -- Can you add a test for it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...
GitHub user adrian-wang opened a pull request: https://github.com/apache/spark/pull/23152 [SPARK-26181][SQL] the `hasMinMaxStats` method of `ColumnStatsMap` is not correct ## What changes were proposed in this pull request? For now the `hasMinMaxStats` will return the same as `hasCountStats`, which is obviously not as expected. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/adrian-wang/spark minmaxstats Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23152.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23152 commit 1c1edccc43ac9cdb6da53ad93ac998b06bf8b6dd Author: Daoyuan Wang Date: 2018-11-27T06:18:34Z fix ColumnStatsMap.hasMinMaxStats --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org