Hey everyone,

I've just added an event to the dev calendar for July 15 at 9am (PT) to
discuss the column stats proposal.


Eduard

On Tue, Jul 8, 2025 at 4:09 AM Jacky Lee <qcsd2...@gmail.com> wrote:

> +1 for the wonderful feature. Please count me in if you need any help.
>
> Gábor Kaszab <gaborkas...@apache.org> 于2025年7月7日周一 21:22写道:
> >
> > +1 Seems a great improvement! Let me know if I can help out with
> implementation, measurements, etc.!
> >
> > Regards,
> > Gabor Kaszab
> >
> > John Zhuge <jzh...@apache.org> ezt írta (időpont: 2025. jún. 5., Cs,
> 23:41):
> >>
> >> +1 Looking forward to this feature
> >>
> >> John Zhuge
> >>
> >>
> >> On Thu, Jun 5, 2025 at 2:22 PM Ryan Blue <rdb...@gmail.com> wrote:
> >>>
> >>> > I think it does not make sense to stick manifest files to Avro if we
> break column stats into sub fields.
> >>>
> >>> This isn't necessarily true. Avro can benefit from better pushdown
> with Eduard's approach as well by being able to skip more efficiently. With
> the current layout, Avro stores a list of key/value pairs that are all
> projected and put into a map. We avoid decoding the values, but each field
> ID is decoded, then the length of the value is decoded, and finally there
> is a put operation with an ID and value ByteBuffer pair. With the new
> approach, we will be able to know which fields are relevant and skip
> unprojected fields based on the file schema, which we couldn't do before.
> >>>
> >>> To skip stats for an unused field (not part of the filter), there are
> two cases. Lower/upper bounds for types that are fixed width are skipped by
> updating the read position. And bounds for types that are variable length
> (strings and binary) are skipped by reading the length and skipping that
> number of bytes.
> >>>
> >>> It turns out that actually producing the metric maps is a fairly
> expensive operation, so being able to skip metrics more quickly even if the
> bytes still have to be read is going to save time. That said, using a
> columnar format is still going to be a good idea!
> >>>
> >>> On Wed, Jun 4, 2025 at 11:22 PM Gang Wu <ust...@gmail.com> wrote:
> >>>>
> >>>> > Together with the change which allows storing metadata in columnar
> formats
> >>>>
> >>>> +1 on this. I think it does not make sense to stick manifest files to
> Avro if we break column stats into sub fields.
> >>>>
> >>>> On Tue, Jun 3, 2025 at 7:19 PM Péter Váry <
> peter.vary.apa...@gmail.com> wrote:
> >>>>>
> >>>>> I would love to see more flexibility in file stats. Together with
> the change which allows storing metadata in columnar formats will open up
> many new possibilities. Bloom filters in metadata which could be used for
> filtering out files, HLL scratches etc....
> >>>>>
> >>>>> +1 for the change
> >>>>>
> >>>>> On Tue, Jun 3, 2025, 08:12 Szehon Ho <szehon.apa...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> +1 , excited for this one too, we've seen the current metrics maps
> blow up the memory and hope can improve that.
> >>>>>>
> >>>>>> On the Geo front, this could allow us to add supplementary metrics
> that don't conform to the geo type, like S2 Cell Ids.
> >>>>>>
> >>>>>> Thanks
> >>>>>> Szehon
> >>>>>>
> >>>>>> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner <
> etudenhoef...@apache.org> wrote:
> >>>>>>>
> >>>>>>> Hey everyone,
> >>>>>>>
> >>>>>>> I'm starting a thread to connect folks interested in improving the
> existing way of collecting column-level statistics (often referred to as
> metrics in the code). I've already started a proposal, which can be found
> at https://s.apache.org/iceberg-column-stats.
> >>>>>>>
> >>>>>>> Motivation
> >>>>>>>
> >>>>>>> Column statistics are currently stored as a mapping of field id to
> values across multiple columns (lower/upper bounds, value/nan/null counts,
> sizes). This storage model has critical limitations as the number of
> columns increases and as new types are being added to Iceberg:
> >>>>>>>
> >>>>>>> Inefficient Storage due to map-based structure:
> >>>>>>>
> >>>>>>> Large memory overhead during planning/processing
> >>>>>>>
> >>>>>>> Inability to project specific stats (e.g., only null_value_counts
> for column X)
> >>>>>>>
> >>>>>>> Type Erasure: Original logical/physical types are lost when stored
> as binary blobs, causing:
> >>>>>>>
> >>>>>>> Lossy type inference during reads
> >>>>>>>
> >>>>>>> Schema evolution challenges (e.g., widening types)
> >>>>>>>
> >>>>>>> Rigid Schema: Stats are tied to the data_fil entry record,
> limiting extensibility for new stats.
> >>>>>>>
> >>>>>>>
> >>>>>>> Goals
> >>>>>>>
> >>>>>>> Improve the column stats representation to allow for the following:
> >>>>>>>
> >>>>>>> Projectability: Enable independent access to specific stats (e.g.,
> lower_bounds without loading upper_bounds).
> >>>>>>>
> >>>>>>> Type Preservation: Store original data types to support accurate
> reads and schema evolution.
> >>>>>>>
> >>>>>>> Flexible/Extensible Representation: Allow per-field stats
> structures (e.g., complex types like Geo/Variant).
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>> Eduard
>

Reply via email to