adriangb commented on code in PR #21996:
URL: https://github.com/apache/datafusion/pull/21996#discussion_r3178156016
##########
datafusion/datasource/src/mod.rs:
##########
@@ -138,6 +139,19 @@ pub struct PartitionedFile {
/// When set via [`Self::with_statistics`], partition column statistics
are automatically
/// computed from [`Self::partition_values`] with exact
min/max/null_count/distinct_count.
pub statistics: Option<Arc<Statistics>>,
+ /// Sparse, request-keyed stats answered by the provider for this file.
+ ///
+ /// Only entries for non-`Absent` answers are present, so memory scales
+ /// with the *count of stats actually requested* rather than the table's
+ /// column count. Used in tandem with — not in place of —
+ /// [`Self::statistics`]: existing consumers that read the dense
+ /// `Statistics` keep working; new consumers (e.g.
+ /// [`datafusion_pruning::FilePruner`]) prefer this sparse map when it's
+ /// populated. Providers that store stats out-of-band (Delta/Iceberg/Hudi
+ /// manifests, Hive Metastore, custom catalogs) can populate this
+ /// directly without rebuilding a full dense `Statistics`.
+ pub satisfied_stats:
+ Option<Arc<std::collections::HashMap<StatisticsRequest,
StatisticsValue>>>,
Review Comment:
A type alias for this type might be nice.
##########
datafusion/catalog-listing/src/table.rs:
##########
@@ -583,7 +594,24 @@ impl TableProvider for ListingTable {
)
.await?;
- Ok(ScanResult::new(plan))
+ // Answer any requested stats from the table-level metadata we
+ // already touched. Anything not derivable from the dense
+ // `Statistics` we computed comes back as `Absent`. Skipped
+ // entirely when the caller didn't ask. We also skip when
+ // `collect_statistics=false` — the contract is "answer what's
+ // free", and computing stats here just to populate this map
+ // would violate that.
Review Comment:
Long term I think it'd be good to get rid of the dense statistics (or as a
first step only create them ephemerally when we read them from the footer) but
that kind of has to happen after there are no more consumers. It seemed easier
to implement the sparse stats deriving from the dense stats for now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]