Re: [DISCUSS] v4 - Improved column statistics

Péter Váry Thu, 17 Jul 2025 03:07:41 -0700

Hi Team,
Sorry, but I was not able to join the discussion on Tuesday :(, but I
listened to the recording.


A few thoughts:
- I still think that the current `10000 + 200 * data_field_id` id
generation logic is very limiting. I have already seen tables with more
than 10k columns for AI space. And I don't think we would like to create
hard limits here.
- If we follow store a few things in metadata (starting_id,
stats_types_present, columns_with_stats), then we can greatly reduce the
used id space, and also allow for extensibility, as the engines which
doesn't know a specific stats_type could just ignore them)

Thanks Eduard and Dan for driving this!
Peter

Eduard Tudenhöfner <etudenhoef...@apache.org> ezt írta (időpont: 2025. júl.
16., Sze, 7:53):

> Hey everyone,
>
> We met yesterday and talked about the column stats proposal.
> Please find the recording here
> <https://drive.google.com/file/d/1WVpSg9XxipO5NzogDc7D4DMsnj9cCMlF/view?usp=sharing>
> and the notes here
> <https://docs.google.com/document/d/1s9_o_Y8js4kHVCYI2OeL0Yh3Ey0qfkuE7Mm-z-QoEfo/edit?usp=sharing>
> .
>
> Thanks everyone,
> Eduard
>
> On Tue, Jul 8, 2025 at 6:51 PM Eduard Tudenhöfner <
> etudenhoef...@apache.org> wrote:
>
>> Hey everyone,
>>
>> I've just added an event to the dev calendar for July 15 at 9am (PT) to
>> discuss the column stats proposal.
>>
>>
>> Eduard
>>
>> On Tue, Jul 8, 2025 at 4:09 AM Jacky Lee <qcsd2...@gmail.com> wrote:
>>
>>> +1 for the wonderful feature. Please count me in if you need any help.
>>>
>>> Gábor Kaszab <gaborkas...@apache.org> 于2025年7月7日周一 21:22写道：
>>> >
>>> > +1 Seems a great improvement! Let me know if I can help out with
>>> implementation, measurements, etc.!
>>> >
>>> > Regards,
>>> > Gabor Kaszab
>>> >
>>> > John Zhuge <jzh...@apache.org> ezt írta (időpont: 2025. jún. 5., Cs,
>>> 23:41):
>>> >>
>>> >> +1 Looking forward to this feature
>>> >>
>>> >> John Zhuge
>>> >>
>>> >>
>>> >> On Thu, Jun 5, 2025 at 2:22 PM Ryan Blue <rdb...@gmail.com> wrote:
>>> >>>
>>> >>> > I think it does not make sense to stick manifest files to Avro if
>>> we break column stats into sub fields.
>>> >>>
>>> >>> This isn't necessarily true. Avro can benefit from better pushdown
>>> with Eduard's approach as well by being able to skip more efficiently. With
>>> the current layout, Avro stores a list of key/value pairs that are all
>>> projected and put into a map. We avoid decoding the values, but each field
>>> ID is decoded, then the length of the value is decoded, and finally there
>>> is a put operation with an ID and value ByteBuffer pair. With the new
>>> approach, we will be able to know which fields are relevant and skip
>>> unprojected fields based on the file schema, which we couldn't do before.
>>> >>>
>>> >>> To skip stats for an unused field (not part of the filter), there
>>> are two cases. Lower/upper bounds for types that are fixed width are
>>> skipped by updating the read position. And bounds for types that are
>>> variable length (strings and binary) are skipped by reading the length and
>>> skipping that number of bytes.
>>> >>>
>>> >>> It turns out that actually producing the metric maps is a fairly
>>> expensive operation, so being able to skip metrics more quickly even if the
>>> bytes still have to be read is going to save time. That said, using a
>>> columnar format is still going to be a good idea!
>>> >>>
>>> >>> On Wed, Jun 4, 2025 at 11:22 PM Gang Wu <ust...@gmail.com> wrote:
>>> >>>>
>>> >>>> > Together with the change which allows storing metadata in
>>> columnar formats
>>> >>>>
>>> >>>> +1 on this. I think it does not make sense to stick manifest files
>>> to Avro if we break column stats into sub fields.
>>> >>>>
>>> >>>> On Tue, Jun 3, 2025 at 7:19 PM Péter Váry <
>>> peter.vary.apa...@gmail.com> wrote:
>>> >>>>>
>>> >>>>> I would love to see more flexibility in file stats. Together with
>>> the change which allows storing metadata in columnar formats will open up
>>> many new possibilities. Bloom filters in metadata which could be used for
>>> filtering out files, HLL scratches etc....
>>> >>>>>
>>> >>>>> +1 for the change
>>> >>>>>
>>> >>>>> On Tue, Jun 3, 2025, 08:12 Szehon Ho <szehon.apa...@gmail.com>
>>> wrote:
>>> >>>>>>
>>> >>>>>> +1 , excited for this one too, we've seen the current metrics
>>> maps blow up the memory and hope can improve that.
>>> >>>>>>
>>> >>>>>> On the Geo front, this could allow us to add supplementary
>>> metrics that don't conform to the geo type, like S2 Cell Ids.
>>> >>>>>>
>>> >>>>>> Thanks
>>> >>>>>> Szehon
>>> >>>>>>
>>> >>>>>> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner <
>>> etudenhoef...@apache.org> wrote:
>>> >>>>>>>
>>> >>>>>>> Hey everyone,
>>> >>>>>>>
>>> >>>>>>> I'm starting a thread to connect folks interested in improving
>>> the existing way of collecting column-level statistics (often referred to
>>> as metrics in the code). I've already started a proposal, which can be
>>> found at https://s.apache.org/iceberg-column-stats.
>>> >>>>>>>
>>> >>>>>>> Motivation
>>> >>>>>>>
>>> >>>>>>> Column statistics are currently stored as a mapping of field id
>>> to values across multiple columns (lower/upper bounds, value/nan/null
>>> counts, sizes). This storage model has critical limitations as the number
>>> of columns increases and as new types are being added to Iceberg:
>>> >>>>>>>
>>> >>>>>>> Inefficient Storage due to map-based structure:
>>> >>>>>>>
>>> >>>>>>> Large memory overhead during planning/processing
>>> >>>>>>>
>>> >>>>>>> Inability to project specific stats (e.g., only
>>> null_value_counts for column X)
>>> >>>>>>>
>>> >>>>>>> Type Erasure: Original logical/physical types are lost when
>>> stored as binary blobs, causing:
>>> >>>>>>>
>>> >>>>>>> Lossy type inference during reads
>>> >>>>>>>
>>> >>>>>>> Schema evolution challenges (e.g., widening types)
>>> >>>>>>>
>>> >>>>>>> Rigid Schema: Stats are tied to the data_fil entry record,
>>> limiting extensibility for new stats.
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> Goals
>>> >>>>>>>
>>> >>>>>>> Improve the column stats representation to allow for the
>>> following:
>>> >>>>>>>
>>> >>>>>>> Projectability: Enable independent access to specific stats
>>> (e.g., lower_bounds without loading upper_bounds).
>>> >>>>>>>
>>> >>>>>>> Type Preservation: Store original data types to support accurate
>>> reads and schema evolution.
>>> >>>>>>>
>>> >>>>>>> Flexible/Extensible Representation: Allow per-field stats
>>> structures (e.g., complex types like Geo/Variant).
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> Thanks
>>> >>>>>>> Eduard
>>>
>>

Re: [DISCUSS] v4 - Improved column statistics

Reply via email to