Prashant Singh created SPARK-39678: -------------------------------------- Summary: Improve stats estimation for v2 tables Key: SPARK-39678 URL: https://issues.apache.org/jira/browse/SPARK-39678 Project: Spark Issue Type: Improvement Components: Optimizer Affects Versions: 3.3.0 Reporter: Prashant Singh
In case of v2 tables, connectors can bubble up both [sizeInBytes and rowCount |https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java]. Presently, SizeInBytesOnlyStatsPlanVisitor, ommits propagating / estimating rowCount stats, some places like : * [CodePointer1|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L54-L58] * [CodePointer2 |https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L46-L47] For the [non-cbo|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/LogicalPlanStats.scala#L34-L39] flow, as per my understanding, this can improve the stats estimation, since rowcount is indirectly used in places to estimate the size as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org