Hey everyone, I've just added an event to the dev calendar for July 15 at 9am (PT) to discuss the column stats proposal.
Eduard On Tue, Jul 8, 2025 at 4:09 AM Jacky Lee <qcsd2...@gmail.com> wrote: > +1 for the wonderful feature. Please count me in if you need any help. > > Gábor Kaszab <gaborkas...@apache.org> 于2025年7月7日周一 21:22写道: > > > > +1 Seems a great improvement! Let me know if I can help out with > implementation, measurements, etc.! > > > > Regards, > > Gabor Kaszab > > > > John Zhuge <jzh...@apache.org> ezt írta (időpont: 2025. jún. 5., Cs, > 23:41): > >> > >> +1 Looking forward to this feature > >> > >> John Zhuge > >> > >> > >> On Thu, Jun 5, 2025 at 2:22 PM Ryan Blue <rdb...@gmail.com> wrote: > >>> > >>> > I think it does not make sense to stick manifest files to Avro if we > break column stats into sub fields. > >>> > >>> This isn't necessarily true. Avro can benefit from better pushdown > with Eduard's approach as well by being able to skip more efficiently. With > the current layout, Avro stores a list of key/value pairs that are all > projected and put into a map. We avoid decoding the values, but each field > ID is decoded, then the length of the value is decoded, and finally there > is a put operation with an ID and value ByteBuffer pair. With the new > approach, we will be able to know which fields are relevant and skip > unprojected fields based on the file schema, which we couldn't do before. > >>> > >>> To skip stats for an unused field (not part of the filter), there are > two cases. Lower/upper bounds for types that are fixed width are skipped by > updating the read position. And bounds for types that are variable length > (strings and binary) are skipped by reading the length and skipping that > number of bytes. > >>> > >>> It turns out that actually producing the metric maps is a fairly > expensive operation, so being able to skip metrics more quickly even if the > bytes still have to be read is going to save time. That said, using a > columnar format is still going to be a good idea! > >>> > >>> On Wed, Jun 4, 2025 at 11:22 PM Gang Wu <ust...@gmail.com> wrote: > >>>> > >>>> > Together with the change which allows storing metadata in columnar > formats > >>>> > >>>> +1 on this. I think it does not make sense to stick manifest files to > Avro if we break column stats into sub fields. > >>>> > >>>> On Tue, Jun 3, 2025 at 7:19 PM Péter Váry < > peter.vary.apa...@gmail.com> wrote: > >>>>> > >>>>> I would love to see more flexibility in file stats. Together with > the change which allows storing metadata in columnar formats will open up > many new possibilities. Bloom filters in metadata which could be used for > filtering out files, HLL scratches etc.... > >>>>> > >>>>> +1 for the change > >>>>> > >>>>> On Tue, Jun 3, 2025, 08:12 Szehon Ho <szehon.apa...@gmail.com> > wrote: > >>>>>> > >>>>>> +1 , excited for this one too, we've seen the current metrics maps > blow up the memory and hope can improve that. > >>>>>> > >>>>>> On the Geo front, this could allow us to add supplementary metrics > that don't conform to the geo type, like S2 Cell Ids. > >>>>>> > >>>>>> Thanks > >>>>>> Szehon > >>>>>> > >>>>>> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner < > etudenhoef...@apache.org> wrote: > >>>>>>> > >>>>>>> Hey everyone, > >>>>>>> > >>>>>>> I'm starting a thread to connect folks interested in improving the > existing way of collecting column-level statistics (often referred to as > metrics in the code). I've already started a proposal, which can be found > at https://s.apache.org/iceberg-column-stats. > >>>>>>> > >>>>>>> Motivation > >>>>>>> > >>>>>>> Column statistics are currently stored as a mapping of field id to > values across multiple columns (lower/upper bounds, value/nan/null counts, > sizes). This storage model has critical limitations as the number of > columns increases and as new types are being added to Iceberg: > >>>>>>> > >>>>>>> Inefficient Storage due to map-based structure: > >>>>>>> > >>>>>>> Large memory overhead during planning/processing > >>>>>>> > >>>>>>> Inability to project specific stats (e.g., only null_value_counts > for column X) > >>>>>>> > >>>>>>> Type Erasure: Original logical/physical types are lost when stored > as binary blobs, causing: > >>>>>>> > >>>>>>> Lossy type inference during reads > >>>>>>> > >>>>>>> Schema evolution challenges (e.g., widening types) > >>>>>>> > >>>>>>> Rigid Schema: Stats are tied to the data_fil entry record, > limiting extensibility for new stats. > >>>>>>> > >>>>>>> > >>>>>>> Goals > >>>>>>> > >>>>>>> Improve the column stats representation to allow for the following: > >>>>>>> > >>>>>>> Projectability: Enable independent access to specific stats (e.g., > lower_bounds without loading upper_bounds). > >>>>>>> > >>>>>>> Type Preservation: Store original data types to support accurate > reads and schema evolution. > >>>>>>> > >>>>>>> Flexible/Extensible Representation: Allow per-field stats > structures (e.g., complex types like Geo/Variant). > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Thanks > >>>>>>> Eduard >