yyanyy commented on code in PR #55546:
URL: https://github.com/apache/spark/pull/55546#discussion_r3139925343
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala:
##########
@@ -199,6 +199,20 @@ case class DataSourceV2ScanRelation(
}
override def computeStats(): Statistics = {
+ // If the table reports catalog-level (pre-filter, pre-pruning) statistics
via
+ // SupportsReportCatalogStatistics and those stats carry a row count,
prefer them:
+ // they describe the full table independent of any pushdown and are
typically the
+ // strongest stats available to CBO. Otherwise fall through to the scan's
own
+ // post-pushdown statistics as before.
+ relation.table match {
Review Comment:
I’m a bit worried that this could unintentionally change behavior for new
connectors implementing this interface without realizing the implications.
Today, DSv2 stats are generally expected to reflect pushed-down filters, given
the existing behavior of e.g. fully pushed-down filters are removed from the
plan. With this change, any stats updates done in the scan implementation by
connectors using the new interface would effectively be ignored.
Long term, I think the optimizer estimation rule should distinguish between
catalog stats and stats that incorporate pushdown, so connectors supporting
either/both model can be handled correctly and the most accurate stats can be
used. Without that optimizer improvement, though, this behavior change feels
non-obvious and potentially confusing. What do you think?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]