Yuming Wang created SPARK-34137: ----------------------------------- Summary: The tree string does not contain statistics for nested scalar sub queries Key: SPARK-34137 URL: https://issues.apache.org/jira/browse/SPARK-34137 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang
How to reproduce: {code:scala} spark.sql("create table t1 using parquet as select id as a, id as b from range(1000)") spark.sql("create table t2 using parquet as select id as c, id as d from range(2000)") spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS") spark.sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS") spark.sql("set spark.sql.cbo.enabled=true") spark.sql( """ |WITH max_store_sales AS | (SELECT max(csales) tpcds_cmax | FROM (SELECT | sum(b) csales | FROM t1 WHERE a < 100 ) x), |best_ss_customer AS | (SELECT | c | FROM t2 | WHERE d > (SELECT * FROM max_store_sales)) | |SELECT c FROM best_ss_customer |""".stripMargin).explain("cost") {code} Output: {noformat} == Optimized Logical Plan == Project [c#4263L], Statistics(sizeInBytes=31.3 KiB, rowCount=2.00E+3) +- Filter (isnotnull(d#4264L) AND (d#4264L > scalar-subquery#4262 [])), Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3) : +- Aggregate [max(csales#4260L) AS tpcds_cmax#4261L] : +- Aggregate [sum(b#4266L) AS csales#4260L] : +- Project [b#4266L] : +- Filter ((a#4265L < 100) AND isnotnull(a#4265L)) : +- Relation default.t1[a#4265L,b#4266L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3) +- Relation default.t2[c#4263L,d#4264L] parquet, Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3) {noformat} Another case is TPC-DS q23a. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org