Hey everyone,
I'm starting a thread to connect folks interested in improving the existing
way of collecting column-level statistics (often referred to as *metrics*
in the code). I've already started a proposal, which can be found at
https://s.apache.org/iceberg-column-stats.
*Motivation*
Column statistics are currently stored as a mapping of field id to values
across multiple columns (lower/upper bounds, value/nan/null counts, sizes).
This storage model has critical limitations as the number of columns
increases and as new types are being added to Iceberg:
-
Inefficient Storage due to map-based structure:
-
Large memory overhead during planning/processing
-
Inability to project specific stats (e.g., only null_value_counts for
column X)
-
Type Erasure: Original logical/physical types are lost when stored as
binary blobs, causing:
-
Lossy type inference during reads
- Schema evolution challenges (e.g., widening types)
- Rigid Schema: Stats are tied to the data_fil entry record, limiting
extensibility for new stats.
*Goals*
Improve the column stats representation to allow for the following:
-
Projectability: Enable independent access to specific stats (e.g.,
lower_bounds without loading upper_bounds).
-
Type Preservation: Store original data types to support accurate reads
and schema evolution.
-
Flexible/Extensible Representation: Allow per-field stats structures
(e.g., complex types like Geo/Variant).
Thanks
Eduard