anton5798 opened a new pull request, #55546:
URL: https://github.com/apache/spark/pull/55546

   ### What changes were proposed in this pull request?
   
   This PR introduces `SupportsReportCatalogStatistics`, a new mix-in interface 
on `org.apache.spark.sql.connector.catalog.Table`. DSv2 connectors can 
implement it to report table-level (pre-filter, pre-pruning) statistics to 
Spark without going through a `Scan`.
   
   ```java
   @Evolving
   public interface SupportsReportCatalogStatistics extends Table {
     Statistics catalogStatistics();
   }
   ```
   
   `DataSourceV2ScanRelation.computeStats` is updated to prefer these catalog 
stats when the table implements the mixin *and* the returned stats carry a row 
count; otherwise it falls through to the existing 
`Scan.estimateStatistics()`-based path unchanged. The change is strictly 
additive: tables that don't implement the mixin see identical behavior.
   
   ### Why are the changes needed?
   
   DSv2 today conflates catalog-level (pre-filter, table-wide — analogous to 
DSv1's `CatalogTable.stats`) and scan-level (post-filter, post-pruning) 
statistics on `Scan.estimateStatistics()`. Reading catalog stats requires 
building a `ScanBuilder`, which can trigger unnecessary work (file listing, 
remote metadata fetches) and hides a logical property of the table inside the 
scan API.
   
   `SupportsReportCatalogStatistics` gives connectors a direct, 
scan-independent way to report table-level stats. It is a strict analog of 
DSv1's `CatalogStatistics` for the v2 catalog API and is a natural input to CBO 
decisions on the unfiltered relation (join reordering, broadcast thresholds), 
while scan statistics continue to tighten estimates once pushdown has happened.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. The new interface is additive and no Spark built-in `Table` implements 
it in this PR. Existing connectors and queries observe no behavior change.
   
   ### How was this patch tested?
   
   No new tests. The `computeStats` change is additive and preserves the 
existing Scan-based path verbatim; existing DSv2 suites (`DataSourceV2Suite`, 
`DataSourceV2SQLSuite`) continue to exercise it. Follow-up PRs that add an 
in-tree implementation of the mixin will add targeted coverage for the new path.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Opus 4.7)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to