Re: [DISCUSS] v4 - Improved column statistics

Eduard Tudenhöfner Tue, 08 Jul 2025 09:52:39 -0700

Hey everyone,

I've just added an event to the dev calendar for July 15 at 9am (PT) to
discuss the column stats proposal.



Eduard

On Tue, Jul 8, 2025 at 4:09 AM Jacky Lee <qcsd2...@gmail.com> wrote:

> +1 for the wonderful feature. Please count me in if you need any help.
>
> Gábor Kaszab <gaborkas...@apache.org> 于2025年7月7日周一 21:22写道：
> >
> > +1 Seems a great improvement! Let me know if I can help out with
> implementation, measurements, etc.!
> >
> > Regards,
> > Gabor Kaszab
> >
> > John Zhuge <jzh...@apache.org> ezt írta (időpont: 2025. jún. 5., Cs,
> 23:41):
> >>
> >> +1 Looking forward to this feature
> >>
> >> John Zhuge
> >>
> >>
> >> On Thu, Jun 5, 2025 at 2:22 PM Ryan Blue <rdb...@gmail.com> wrote:
> >>>
> >>> > I think it does not make sense to stick manifest files to Avro if we
> break column stats into sub fields.
> >>>
> >>> This isn't necessarily true. Avro can benefit from better pushdown
> with Eduard's approach as well by being able to skip more efficiently. With
> the current layout, Avro stores a list of key/value pairs that are all
> projected and put into a map. We avoid decoding the values, but each field
> ID is decoded, then the length of the value is decoded, and finally there
> is a put operation with an ID and value ByteBuffer pair. With the new
> approach, we will be able to know which fields are relevant and skip
> unprojected fields based on the file schema, which we couldn't do before.
> >>>
> >>> To skip stats for an unused field (not part of the filter), there are
> two cases. Lower/upper bounds for types that are fixed width are skipped by
> updating the read position. And bounds for types that are variable length
> (strings and binary) are skipped by reading the length and skipping that
> number of bytes.
> >>>
> >>> It turns out that actually producing the metric maps is a fairly
> expensive operation, so being able to skip metrics more quickly even if the
> bytes still have to be read is going to save time. That said, using a
> columnar format is still going to be a good idea!
> >>>
> >>> On Wed, Jun 4, 2025 at 11:22 PM Gang Wu <ust...@gmail.com> wrote:
> >>>>
> >>>> > Together with the change which allows storing metadata in columnar
> formats
> >>>>
> >>>> +1 on this. I think it does not make sense to stick manifest files to
> Avro if we break column stats into sub fields.
> >>>>
> >>>> On Tue, Jun 3, 2025 at 7:19 PM Péter Váry <
> peter.vary.apa...@gmail.com> wrote:
> >>>>>
> >>>>> I would love to see more flexibility in file stats. Together with
> the change which allows storing metadata in columnar formats will open up
> many new possibilities. Bloom filters in metadata which could be used for
> filtering out files, HLL scratches etc....
> >>>>>
> >>>>> +1 for the change
> >>>>>
> >>>>> On Tue, Jun 3, 2025, 08:12 Szehon Ho <szehon.apa...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> +1 , excited for this one too, we've seen the current metrics maps
> blow up the memory and hope can improve that.
> >>>>>>
> >>>>>> On the Geo front, this could allow us to add supplementary metrics
> that don't conform to the geo type, like S2 Cell Ids.
> >>>>>>
> >>>>>> Thanks
> >>>>>> Szehon
> >>>>>>
> >>>>>> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner <
> etudenhoef...@apache.org> wrote:
> >>>>>>>
> >>>>>>> Hey everyone,
> >>>>>>>
> >>>>>>> I'm starting a thread to connect folks interested in improving the
> existing way of collecting column-level statistics (often referred to as
> metrics in the code). I've already started a proposal, which can be found
> at https://s.apache.org/iceberg-column-stats.
> >>>>>>>
> >>>>>>> Motivation
> >>>>>>>
> >>>>>>> Column statistics are currently stored as a mapping of field id to
> values across multiple columns (lower/upper bounds, value/nan/null counts,
> sizes). This storage model has critical limitations as the number of
> columns increases and as new types are being added to Iceberg:
> >>>>>>>
> >>>>>>> Inefficient Storage due to map-based structure:
> >>>>>>>
> >>>>>>> Large memory overhead during planning/processing
> >>>>>>>
> >>>>>>> Inability to project specific stats (e.g., only null_value_counts
> for column X)
> >>>>>>>
> >>>>>>> Type Erasure: Original logical/physical types are lost when stored
> as binary blobs, causing:
> >>>>>>>
> >>>>>>> Lossy type inference during reads
> >>>>>>>
> >>>>>>> Schema evolution challenges (e.g., widening types)
> >>>>>>>
> >>>>>>> Rigid Schema: Stats are tied to the data_fil entry record,
> limiting extensibility for new stats.
> >>>>>>>
> >>>>>>>
> >>>>>>> Goals
> >>>>>>>
> >>>>>>> Improve the column stats representation to allow for the following:
> >>>>>>>
> >>>>>>> Projectability: Enable independent access to specific stats (e.g.,
> lower_bounds without loading upper_bounds).
> >>>>>>>
> >>>>>>> Type Preservation: Store original data types to support accurate
> reads and schema evolution.
> >>>>>>>
> >>>>>>> Flexible/Extensible Representation: Allow per-field stats
> structures (e.g., complex types like Geo/Variant).
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>> Eduard
>

Re: [DISCUSS] v4 - Improved column statistics

Reply via email to