Re: [DISCUSS] v4 - Improved column statistics

Eduard Tudenhöfner Tue, 15 Jul 2025 22:53:38 -0700

Hey everyone,

We met yesterday and talked about the column stats proposal.
Please find the recording here
<https://drive.google.com/file/d/1WVpSg9XxipO5NzogDc7D4DMsnj9cCMlF/view?usp=sharing>
and the notes here
<https://docs.google.com/document/d/1s9_o_Y8js4kHVCYI2OeL0Yh3Ey0qfkuE7Mm-z-QoEfo/edit?usp=sharing>
.


Thanks everyone,
Eduard

On Tue, Jul 8, 2025 at 6:51 PM Eduard Tudenhöfner <etudenhoef...@apache.org>
wrote:

> Hey everyone,
>
> I've just added an event to the dev calendar for July 15 at 9am (PT) to
> discuss the column stats proposal.
>
>
> Eduard
>
> On Tue, Jul 8, 2025 at 4:09 AM Jacky Lee <qcsd2...@gmail.com> wrote:
>
>> +1 for the wonderful feature. Please count me in if you need any help.
>>
>> Gábor Kaszab <gaborkas...@apache.org> 于2025年7月7日周一 21:22写道：
>> >
>> > +1 Seems a great improvement! Let me know if I can help out with
>> implementation, measurements, etc.!
>> >
>> > Regards,
>> > Gabor Kaszab
>> >
>> > John Zhuge <jzh...@apache.org> ezt írta (időpont: 2025. jún. 5., Cs,
>> 23:41):
>> >>
>> >> +1 Looking forward to this feature
>> >>
>> >> John Zhuge
>> >>
>> >>
>> >> On Thu, Jun 5, 2025 at 2:22 PM Ryan Blue <rdb...@gmail.com> wrote:
>> >>>
>> >>> > I think it does not make sense to stick manifest files to Avro if
>> we break column stats into sub fields.
>> >>>
>> >>> This isn't necessarily true. Avro can benefit from better pushdown
>> with Eduard's approach as well by being able to skip more efficiently. With
>> the current layout, Avro stores a list of key/value pairs that are all
>> projected and put into a map. We avoid decoding the values, but each field
>> ID is decoded, then the length of the value is decoded, and finally there
>> is a put operation with an ID and value ByteBuffer pair. With the new
>> approach, we will be able to know which fields are relevant and skip
>> unprojected fields based on the file schema, which we couldn't do before.
>> >>>
>> >>> To skip stats for an unused field (not part of the filter), there are
>> two cases. Lower/upper bounds for types that are fixed width are skipped by
>> updating the read position. And bounds for types that are variable length
>> (strings and binary) are skipped by reading the length and skipping that
>> number of bytes.
>> >>>
>> >>> It turns out that actually producing the metric maps is a fairly
>> expensive operation, so being able to skip metrics more quickly even if the
>> bytes still have to be read is going to save time. That said, using a
>> columnar format is still going to be a good idea!
>> >>>
>> >>> On Wed, Jun 4, 2025 at 11:22 PM Gang Wu <ust...@gmail.com> wrote:
>> >>>>
>> >>>> > Together with the change which allows storing metadata in columnar
>> formats
>> >>>>
>> >>>> +1 on this. I think it does not make sense to stick manifest files
>> to Avro if we break column stats into sub fields.
>> >>>>
>> >>>> On Tue, Jun 3, 2025 at 7:19 PM Péter Váry <
>> peter.vary.apa...@gmail.com> wrote:
>> >>>>>
>> >>>>> I would love to see more flexibility in file stats. Together with
>> the change which allows storing metadata in columnar formats will open up
>> many new possibilities. Bloom filters in metadata which could be used for
>> filtering out files, HLL scratches etc....
>> >>>>>
>> >>>>> +1 for the change
>> >>>>>
>> >>>>> On Tue, Jun 3, 2025, 08:12 Szehon Ho <szehon.apa...@gmail.com>
>> wrote:
>> >>>>>>
>> >>>>>> +1 , excited for this one too, we've seen the current metrics maps
>> blow up the memory and hope can improve that.
>> >>>>>>
>> >>>>>> On the Geo front, this could allow us to add supplementary metrics
>> that don't conform to the geo type, like S2 Cell Ids.
>> >>>>>>
>> >>>>>> Thanks
>> >>>>>> Szehon
>> >>>>>>
>> >>>>>> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner <
>> etudenhoef...@apache.org> wrote:
>> >>>>>>>
>> >>>>>>> Hey everyone,
>> >>>>>>>
>> >>>>>>> I'm starting a thread to connect folks interested in improving
>> the existing way of collecting column-level statistics (often referred to
>> as metrics in the code). I've already started a proposal, which can be
>> found at https://s.apache.org/iceberg-column-stats.
>> >>>>>>>
>> >>>>>>> Motivation
>> >>>>>>>
>> >>>>>>> Column statistics are currently stored as a mapping of field id
>> to values across multiple columns (lower/upper bounds, value/nan/null
>> counts, sizes). This storage model has critical limitations as the number
>> of columns increases and as new types are being added to Iceberg:
>> >>>>>>>
>> >>>>>>> Inefficient Storage due to map-based structure:
>> >>>>>>>
>> >>>>>>> Large memory overhead during planning/processing
>> >>>>>>>
>> >>>>>>> Inability to project specific stats (e.g., only null_value_counts
>> for column X)
>> >>>>>>>
>> >>>>>>> Type Erasure: Original logical/physical types are lost when
>> stored as binary blobs, causing:
>> >>>>>>>
>> >>>>>>> Lossy type inference during reads
>> >>>>>>>
>> >>>>>>> Schema evolution challenges (e.g., widening types)
>> >>>>>>>
>> >>>>>>> Rigid Schema: Stats are tied to the data_fil entry record,
>> limiting extensibility for new stats.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Goals
>> >>>>>>>
>> >>>>>>> Improve the column stats representation to allow for the
>> following:
>> >>>>>>>
>> >>>>>>> Projectability: Enable independent access to specific stats
>> (e.g., lower_bounds without loading upper_bounds).
>> >>>>>>>
>> >>>>>>> Type Preservation: Store original data types to support accurate
>> reads and schema evolution.
>> >>>>>>>
>> >>>>>>> Flexible/Extensible Representation: Allow per-field stats
>> structures (e.g., complex types like Geo/Variant).
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Thanks
>> >>>>>>> Eduard
>>
>

Re: [DISCUSS] v4 - Improved column statistics

Reply via email to