anton5798 opened a new pull request, #55546:
URL: https://github.com/apache/spark/pull/55546
### What changes were proposed in this pull request?
This PR introduces `SupportsReportCatalogStatistics`, a new mix-in interface
on `org.apache.spark.sql.connector.catalog.Table`. DSv2 connectors can
implement it to report table-level (pre-filter, pre-pruning) statistics to
Spark without going through a `Scan`.
```java
@Evolving
public interface SupportsReportCatalogStatistics extends Table {
Statistics catalogStatistics();
}
```
`DataSourceV2ScanRelation.computeStats` is updated to prefer these catalog
stats when the table implements the mixin *and* the returned stats carry a row
count; otherwise it falls through to the existing
`Scan.estimateStatistics()`-based path unchanged. The change is strictly
additive: tables that don't implement the mixin see identical behavior.
### Why are the changes needed?
DSv2 today conflates catalog-level (pre-filter, table-wide — analogous to
DSv1's `CatalogTable.stats`) and scan-level (post-filter, post-pruning)
statistics on `Scan.estimateStatistics()`. Reading catalog stats requires
building a `ScanBuilder`, which can trigger unnecessary work (file listing,
remote metadata fetches) and hides a logical property of the table inside the
scan API.
`SupportsReportCatalogStatistics` gives connectors a direct,
scan-independent way to report table-level stats. It is a strict analog of
DSv1's `CatalogStatistics` for the v2 catalog API and is a natural input to CBO
decisions on the unfiltered relation (join reordering, broadcast thresholds),
while scan statistics continue to tighten estimates once pushdown has happened.
### Does this PR introduce _any_ user-facing change?
No. The new interface is additive and no Spark built-in `Table` implements
it in this PR. Existing connectors and queries observe no behavior change.
### How was this patch tested?
No new tests. The `computeStats` change is additive and preserves the
existing Scan-based path verbatim; existing DSv2 suites (`DataSourceV2Suite`,
`DataSourceV2SQLSuite`) continue to exercise it. Follow-up PRs that add an
in-tree implementation of the mixin will add targeted coverage for the new path.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.7)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]