yyanyy commented on code in PR #55546:
URL: https://github.com/apache/spark/pull/55546#discussion_r3139925343


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala:
##########
@@ -199,6 +199,20 @@ case class DataSourceV2ScanRelation(
   }
 
   override def computeStats(): Statistics = {
+    // If the table reports catalog-level (pre-filter, pre-pruning) statistics 
via
+    // SupportsReportCatalogStatistics and those stats carry a row count, 
prefer them:
+    // they describe the full table independent of any pushdown and are 
typically the
+    // strongest stats available to CBO. Otherwise fall through to the scan's 
own
+    // post-pushdown statistics as before.
+    relation.table match {

Review Comment:
   I’m a bit worried that this could unintentionally change behavior for new 
connectors implementing this interface without realizing the implications. 
Today, DSv2 stats are generally expected to reflect pushed-down filters, given 
the existing behavior of e.g. fully pushed-down filters are removed from the 
plan. With this change, any stats updates done in the scan implementation by 
connectors using the new interface would effectively be ignored.
   
   Long term, I think the optimizer estimation rule should distinguish between 
catalog stats and stats that incorporate pushdown, so connectors supporting 
either/both model can be handled correctly and the most accurate stats can be 
used. Without that optimizer improvement, though, this behavior change feels 
non-obvious and potentially confusing. What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to