[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats
AngersZh commented on a change in pull request #31485: URL: https://github.com/apache/spark/pull/31485#discussion_r572535922 ## File path: sql/core/src/test/resources/sql-tests/results/explain-cbo.sql.out ## @@ -0,0 +1,80 @@ +-- Automatically generated by SQLQueryTestSuite +-- Number of queries: 5 + + +-- !query +CREATE TABLE t1(a INT, b INT) USING PARQUET +-- !query schema +struct<> +-- !query output + + + +-- !query +CREATE TABLE t2(c INT, d INT) USING PARQUET +-- !query schema +struct<> +-- !query output + + + +-- !query +ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS +-- !query schema +struct<> +-- !query output + + + +-- !query +ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS +-- !query schema +struct<> +-- !query output + + + +-- !query +EXPLAIN COST WITH max_store_sales AS +( + SELECT max(csales) tpcds_cmax + FROM ( +SELECT sum(b) csales +FROM t1 WHERE a < 100 + ) x +), +best_ss_customer AS +( + SELECT c + FROM t2 + WHERE d > (SELECT * FROM max_store_sales) +) +SELECT c FROM best_ss_customer +-- !query schema +struct +-- !query output +== Optimized Logical Plan == +Project [c#x], Statistics(sizeInBytes=1.0 B, rowCount=0) ++- Filter (isnotnull(d#x) AND (cast(d#x as bigint) > scalar-subquery#x [])), Statistics(sizeInBytes=1.0 B, rowCount=0) + : +- Aggregate [max(csales#xL) AS tpcds_cmax#xL], Statistics(sizeInBytes=16.0 B, rowCount=1) + : +- Aggregate [sum(b#x) AS csales#xL], Statistics(sizeInBytes=16.0 B, rowCount=1) + :+- Project [b#x], Statistics(sizeInBytes=1.0 B, rowCount=0) + : +- Filter (isnotnull(a#x) AND (a#x < 100)), Statistics(sizeInBytes=1.0 B, rowCount=0) + : +- Relation[a#x,b#x] parquet, Statistics(sizeInBytes=1.0 B, rowCount=0) + +- Relation[c#x,d#x] parquet, Statistics(sizeInBytes=1.0 B, rowCount=0) + +== Physical Plan == +AdaptiveSparkPlan isFinalPlan=false ++- Project [c#x] + +- Filter (isnotnull(d#x) AND (cast(d#x as bigint) > Subquery subquery#x, [id=#x])) + : +- Subquery subquery#x, [id=#x] + : +- AdaptiveSparkPlan isFinalPlan=false + :+- HashAggregate(keys=[], functions=[max(csales#xL)], output=[tpcds_cmax#xL]) + : +- HashAggregate(keys=[], functions=[partial_max(csales#xL)], output=[max#xL]) + : +- HashAggregate(keys=[], functions=[sum(b#x)], output=[csales#xL]) + : +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#x] + :+- HashAggregate(keys=[], functions=[partial_sum(b#x)], output=[sum#xL]) + : +- Project [b#x] + : +- Filter (isnotnull(a#x) AND (a#x < 100)) + : +- FileScan parquet default.t1[a#x,b#x] Batched: true, DataFilters: [isnotnull(a#x), (a#x < 100)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yi.zhu/Documents/project/Angerszh/spark/sql/core/spark..., PartitionFilters: [], PushedFilters: [IsNotNull(a), LessThan(a,100)], ReadSchema: struct Review comment: > The Physical Plan should be normalized. Hmmm. got the usage, updated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats
AngersZh commented on a change in pull request #31485: URL: https://github.com/apache/spark/pull/31485#discussion_r572518676 ## File path: sql/core/src/test/resources/sql-tests/inputs/explain-cbo.sql ## @@ -0,0 +1,25 @@ +CREATE TABLE t1(a INT, b INT) USING PARQUET; +CREATE TABLE t2(c INT, d INT) USING PARQUET; + +ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS; +ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS; + +SET spark.sql.cbo.enabled=true; Review comment: > We can add `--SET spark.sql.cbo.enabled=true` at the first line of this file. Updated This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats
AngersZh commented on a change in pull request #31485: URL: https://github.com/apache/spark/pull/31485#discussion_r572208986 ## File path: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ## @@ -678,4 +679,49 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } } + + test("SPARK-34137: Update suquery's stats when build LogicalPlan's stats") { +withTable("t1", "t2") { + sql("create table t1 using parquet as select id as a, id as b from range(1000)") + sql("create table t2 using parquet as select id as c, id as d from range(2000)") + + sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS") + sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS") + sql("set spark.sql.cbo.enabled=true") + + val df = sql( +""" + |WITH max_store_sales AS + |( + | SELECT max(csales) tpcds_cmax + | FROM ( + |SELECT sum(b) csales + |FROM t1 WHERE a < 100 + | ) x + |), + |best_ss_customer AS + |( + | SELECT c + | FROM t2 + | WHERE d > (SELECT * FROM max_store_sales) + |) + |SELECT c FROM best_ss_customer + |""".stripMargin) + df.queryExecution.stringWithStats Review comment: > AFAIK the test framework erases the location string. Can you try it out? H, updated This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats
AngersZh commented on a change in pull request #31485: URL: https://github.com/apache/spark/pull/31485#discussion_r572125668 ## File path: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ## @@ -678,4 +679,49 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } } + + test("SPARK-34137: Update suquery's stats when build LogicalPlan's stats") { +withTable("t1", "t2") { + sql("create table t1 using parquet as select id as a, id as b from range(1000)") + sql("create table t2 using parquet as select id as c, id as d from range(2000)") + + sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS") + sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS") + sql("set spark.sql.cbo.enabled=true") Review comment: > nit: use `withSQLConf` DOne This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats
AngersZh commented on a change in pull request #31485: URL: https://github.com/apache/spark/pull/31485#discussion_r572123315 ## File path: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ## @@ -678,4 +679,49 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } } + + test("SPARK-34137: Update suquery's stats when build LogicalPlan's stats") { +withTable("t1", "t2") { + sql("create table t1 using parquet as select id as a, id as b from range(1000)") + sql("create table t2 using parquet as select id as c, id as d from range(2000)") + + sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS") + sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS") + sql("set spark.sql.cbo.enabled=true") + + val df = sql( +""" + |WITH max_store_sales AS + |( + | SELECT max(csales) tpcds_cmax + | FROM ( + |SELECT sum(b) csales + |FROM t1 WHERE a < 100 + | ) x + |), + |best_ss_customer AS + |( + | SELECT c + | FROM t2 + | WHERE d > (SELECT * FROM max_store_sales) + |) + |SELECT c FROM best_ss_customer + |""".stripMargin) + df.queryExecution.stringWithStats Review comment: > can we create `explain-cbo.sql` and move this test to there? Explain will carry some local info such as ``` +- FileScan parquet default.t2[c#xL,d#xL] Batched: true, DataFilters: [isnotnull(d#xL)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yi.zhu/Documents/project/Angerszh/spark/sql/core/spark..., PartitionFilters: [], PushedFilters: [IsNotNull(d)], ReadSchema: struct ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats
AngersZh commented on a change in pull request #31485: URL: https://github.com/apache/spark/pull/31485#discussion_r572074133 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala ## @@ -253,6 +254,15 @@ class QueryExecution( // trigger to compute stats for logical plans try { + optimizedPlan.transform { Review comment: > shall we use `foreach` instead of `transform`? Yea, seems foreach more better in here This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats
AngersZh commented on a change in pull request #31485: URL: https://github.com/apache/spark/pull/31485#discussion_r572073126 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala ## @@ -253,6 +254,15 @@ class QueryExecution( // trigger to compute stats for logical plans try { + optimizedPlan.transform { +case p => p.transformExpressions { Review comment: > ditto, use `p.expressions.foreach(_.foreach ...)` instead. Updated, how about current. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats
AngersZh commented on a change in pull request #31485: URL: https://github.com/apache/spark/pull/31485#discussion_r572051494 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala ## @@ -47,6 +49,16 @@ trait LogicalPlanVisitor[T] { def default(p: LogicalPlan): T + def visitSubqueryExpression(p: LogicalPlan): LogicalPlan = { +p.transformExpressionsDown { + case subqueryExpression: SubqueryExpression => +// trigger subquery's child plan stats propagation Review comment: > Yea let's move it to `stringWithStats` Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats
AngersZh commented on a change in pull request #31485: URL: https://github.com/apache/spark/pull/31485#discussion_r57261 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala ## @@ -47,6 +49,16 @@ trait LogicalPlanVisitor[T] { def default(p: LogicalPlan): T + def visitSubqueryExpression(p: LogicalPlan): LogicalPlan = { +p.transformExpressionsDown { Review comment: > If we don't care about iteration order, `transformExpressions` should be better. Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats
AngersZh commented on a change in pull request #31485: URL: https://github.com/apache/spark/pull/31485#discussion_r571993370 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala ## @@ -47,6 +49,16 @@ trait LogicalPlanVisitor[T] { def default(p: LogicalPlan): T + def visitSubqueryExpression(p: LogicalPlan): LogicalPlan = { +p.transformExpressionsDown { + case subqueryExpression: SubqueryExpression => +// trigger subquery's child plan stats propagation Review comment: If it looks too weird, we can move this to `stringWithStats` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats
AngersZh commented on a change in pull request #31485: URL: https://github.com/apache/spark/pull/31485#discussion_r571992802 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala ## @@ -47,6 +49,16 @@ trait LogicalPlanVisitor[T] { def default(p: LogicalPlan): T + def visitSubqueryExpression(p: LogicalPlan): LogicalPlan = { +p.transformExpressionsDown { + case subqueryExpression: SubqueryExpression => +// trigger subquery's child plan stats propagation Review comment: > This is weird. Doesn't EXPLAIN trigger the plan stats propagation? Yea, EXPLAIN trigger plan stats before build string: ``` private def stringWithStats(maxFields: Int, append: String => Unit): Unit = { val maxFields = SQLConf.get.maxToStringFields // trigger to compute stats for logical plans try { optimizedPlan.stats } catch { case e: AnalysisException => append(e.toString + "\n") } // only show optimized logical plan and physical plan append("== Optimized Logical Plan ==\n") QueryPlan.append(optimizedPlan, append, verbose = true, addSuffix = true, maxFields) append("\n== Physical Plan ==\n") QueryPlan.append(executedPlan, append, verbose = true, addSuffix = false, maxFields) append("\n") } ``` Have tested that in subquery, all behavior about statistic is right since when it use statistcs,it will call `plan.stats`. We do it here just trigger it earlier. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats
AngersZh commented on a change in pull request #31485: URL: https://github.com/apache/spark/pull/31485#discussion_r571780104 ## File path: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ## @@ -678,4 +680,50 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } } + + test("SPARK-34137: Update suquery's stats when build LogicalPlan's stats") { +withTable("t1", "t2") { + sql("create table t1 using parquet as select id as a, id as b from range(1000)") + sql("create table t2 using parquet as select id as c, id as d from range(2000)") + + sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS") + sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS") + sql("set spark.sql.cbo.enabled=true") + + val df = sql( +""" + |WITH max_store_sales AS + |( + | SELECT max(csales) tpcds_cmax + | FROM ( + |SELECT sum(b) csales + |FROM t1 WHERE a < 100 + | ) x + |), + |best_ss_customer AS + |( + | SELECT c + | FROM t2 + | WHERE d > (SELECT * FROM max_store_sales) + |) + |SELECT c FROM best_ss_customer + |""".stripMargin) + val optimizedPlan = df.queryExecution.optimizedPlan + optimizedPlan.stats + val subqueryExpression = ArrayBuffer.empty[SubqueryExpression] Review comment: > nit: `mutable.ArrayBuffer.empty[SubqueryExpression]` > > (since we already imported `mutable` package) Updated This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats
AngersZh commented on a change in pull request #31485: URL: https://github.com/apache/spark/pull/31485#discussion_r571540424 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala ## @@ -47,6 +49,15 @@ trait LogicalPlanVisitor[T] { def default(p: LogicalPlan): T + def visitSubqueryExpression(p: LogicalPlan): LogicalPlan = { +p.transformExpressionsDown { + case subqueryExpression: SubqueryExpression => +subqueryExpression.plan.stats Review comment: > Can you also add a comment that this is to trigger stats propagation? Done ## File path: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ## @@ -678,4 +680,50 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } } + + test("SPARK-34137: Update suquery's stats when build LogicalPlan's stats") { +withTable("t1", "t2") { + sql("create table t1 using parquet as select id as a, id as b from range(1000)") + sql("create table t2 using parquet as select id as c, id as d from range(2000)") + + sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS") + sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS") + sql("set spark.sql.cbo.enabled=true") + + val df = sql( +""" + |WITH max_store_sales AS + |( + | SELECT max(csales) tpcds_cmax + | FROM ( + |SELECT sum(b) csales + |FROM t1 WHERE a < 100 + | ) x + |), + |best_ss_customer AS + |( + | SELECT c + | FROM t2 + | WHERE d > (SELECT * FROM max_store_sales) + |) + |SELECT c FROM best_ss_customer + |""".stripMargin) + val optimizedPlan = df.queryExecution.optimizedPlan + optimizedPlan.stats + val subqueryExpression = new ArrayBuffer[SubqueryExpression]() Review comment: > nit `ArrayBuffer.empty[SubqueryExpression]` Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org