[GitHub] [spark] cloud-fan commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

2021-02-08 Thread GitBox


cloud-fan commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572226095



##
File path: sql/core/src/test/resources/sql-tests/inputs/explain-cbo.sql
##
@@ -0,0 +1,25 @@
+CREATE TABLE t1(a INT, b INT) USING PARQUET;
+CREATE TABLE t2(c INT, d INT) USING PARQUET;
+
+ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS;
+ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS;
+
+SET spark.sql.cbo.enabled=true;

Review comment:
   We can add  `--SET spark.sql.cbo.enabled=true`  at the first line of 
this file.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

2021-02-08 Thread GitBox


cloud-fan commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572163060



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
##
@@ -678,4 +679,49 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
   }
 }
   }
+
+  test("SPARK-34137: Update suquery's stats when build LogicalPlan's stats") {
+withTable("t1", "t2") {
+  sql("create table t1 using parquet as select id as a, id as b from 
range(1000)")
+  sql("create table t2 using parquet as select id as c, id as d from 
range(2000)")
+
+  sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
+  sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
+  sql("set spark.sql.cbo.enabled=true")
+
+  val df = sql(
+"""
+  |WITH max_store_sales AS
+  |(
+  |  SELECT max(csales) tpcds_cmax
+  |  FROM (
+  |SELECT sum(b) csales
+  |FROM t1 WHERE a < 100
+  |  ) x
+  |),
+  |best_ss_customer AS
+  |(
+  |  SELECT c
+  |  FROM t2
+  |  WHERE d > (SELECT * FROM max_store_sales)
+  |)
+  |SELECT c FROM best_ss_customer
+  |""".stripMargin)
+  df.queryExecution.stringWithStats

Review comment:
   AFAIK the test framework erases the location string. Can you try it out?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

2021-02-08 Thread GitBox


cloud-fan commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572094831



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
##
@@ -678,4 +679,49 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
   }
 }
   }
+
+  test("SPARK-34137: Update suquery's stats when build LogicalPlan's stats") {
+withTable("t1", "t2") {
+  sql("create table t1 using parquet as select id as a, id as b from 
range(1000)")
+  sql("create table t2 using parquet as select id as c, id as d from 
range(2000)")
+
+  sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
+  sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
+  sql("set spark.sql.cbo.enabled=true")
+
+  val df = sql(
+"""
+  |WITH max_store_sales AS
+  |(
+  |  SELECT max(csales) tpcds_cmax
+  |  FROM (
+  |SELECT sum(b) csales
+  |FROM t1 WHERE a < 100
+  |  ) x
+  |),
+  |best_ss_customer AS
+  |(
+  |  SELECT c
+  |  FROM t2
+  |  WHERE d > (SELECT * FROM max_store_sales)
+  |)
+  |SELECT c FROM best_ss_customer
+  |""".stripMargin)
+  df.queryExecution.stringWithStats

Review comment:
   can we create `explain-cbo.sql` and move this test to there?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

2021-02-08 Thread GitBox


cloud-fan commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572092844



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
##
@@ -678,4 +679,49 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
   }
 }
   }
+
+  test("SPARK-34137: Update suquery's stats when build LogicalPlan's stats") {
+withTable("t1", "t2") {
+  sql("create table t1 using parquet as select id as a, id as b from 
range(1000)")
+  sql("create table t2 using parquet as select id as c, id as d from 
range(2000)")
+
+  sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
+  sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
+  sql("set spark.sql.cbo.enabled=true")

Review comment:
   nit: use `withSQLConf`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

2021-02-08 Thread GitBox


cloud-fan commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572053175



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
##
@@ -253,6 +254,15 @@ class QueryExecution(
 
 // trigger to compute stats for logical plans
 try {
+  optimizedPlan.transform {
+case p => p.transformExpressions {

Review comment:
   ditto, use `p.expressions.foreach(_.foreach ...)` instead.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

2021-02-08 Thread GitBox


cloud-fan commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572052721



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
##
@@ -253,6 +254,15 @@ class QueryExecution(
 
 // trigger to compute stats for logical plans
 try {
+  optimizedPlan.transform {

Review comment:
   shall we use `foreach` instead of `transform`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

2021-02-08 Thread GitBox


cloud-fan commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r572014419



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala
##
@@ -47,6 +49,16 @@ trait LogicalPlanVisitor[T] {
 
   def default(p: LogicalPlan): T
 
+  def visitSubqueryExpression(p: LogicalPlan): LogicalPlan = {
+p.transformExpressionsDown {
+  case subqueryExpression: SubqueryExpression =>
+// trigger subquery's child plan stats propagation

Review comment:
   Yea let's move it to `stringWithStats`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

2021-02-08 Thread GitBox


cloud-fan commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r571988475



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala
##
@@ -47,6 +49,16 @@ trait LogicalPlanVisitor[T] {
 
   def default(p: LogicalPlan): T
 
+  def visitSubqueryExpression(p: LogicalPlan): LogicalPlan = {
+p.transformExpressionsDown {
+  case subqueryExpression: SubqueryExpression =>
+// trigger subquery's child plan stats propagation

Review comment:
   This is weird. Doesn't EXPLAIN trigger the plan stats propagation?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #31485: [SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats

2021-02-08 Thread GitBox


cloud-fan commented on a change in pull request #31485:
URL: https://github.com/apache/spark/pull/31485#discussion_r571987869



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala
##
@@ -47,6 +49,16 @@ trait LogicalPlanVisitor[T] {
 
   def default(p: LogicalPlan): T
 
+  def visitSubqueryExpression(p: LogicalPlan): LogicalPlan = {
+p.transformExpressionsDown {

Review comment:
   If we don't care about iteration order, `transformExpressions` should be 
better.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org