Prashant Singh created SPARK-39678:
--------------------------------------

             Summary: Improve stats estimation for v2 tables
                 Key: SPARK-39678
                 URL: https://issues.apache.org/jira/browse/SPARK-39678
             Project: Spark
          Issue Type: Improvement
          Components: Optimizer
    Affects Versions: 3.3.0
            Reporter: Prashant Singh


In case of v2 tables, connectors can bubble up both [sizeInBytes and rowCount 
|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java].

Presently, SizeInBytesOnlyStatsPlanVisitor, ommits propagating / estimating 
rowCount stats, some places like :
 * 
[CodePointer1|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L54-L58]
 * [CodePointer2 
|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L46-L47]

For the 
[non-cbo|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/LogicalPlanStats.scala#L34-L39]
 flow, as per my understanding, this can improve the stats estimation, since 
rowcount is indirectly used in places to estimate the size as well. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to