Hey everyone, We met yesterday and talked about the column stats proposal. Please find the recording here <https://drive.google.com/file/d/1WVpSg9XxipO5NzogDc7D4DMsnj9cCMlF/view?usp=sharing> and the notes here <https://docs.google.com/document/d/1s9_o_Y8js4kHVCYI2OeL0Yh3Ey0qfkuE7Mm-z-QoEfo/edit?usp=sharing> .
Thanks everyone, Eduard On Tue, Jul 8, 2025 at 6:51 PM Eduard Tudenhöfner <etudenhoef...@apache.org> wrote: > Hey everyone, > > I've just added an event to the dev calendar for July 15 at 9am (PT) to > discuss the column stats proposal. > > > Eduard > > On Tue, Jul 8, 2025 at 4:09 AM Jacky Lee <qcsd2...@gmail.com> wrote: > >> +1 for the wonderful feature. Please count me in if you need any help. >> >> Gábor Kaszab <gaborkas...@apache.org> 于2025年7月7日周一 21:22写道: >> > >> > +1 Seems a great improvement! Let me know if I can help out with >> implementation, measurements, etc.! >> > >> > Regards, >> > Gabor Kaszab >> > >> > John Zhuge <jzh...@apache.org> ezt írta (időpont: 2025. jún. 5., Cs, >> 23:41): >> >> >> >> +1 Looking forward to this feature >> >> >> >> John Zhuge >> >> >> >> >> >> On Thu, Jun 5, 2025 at 2:22 PM Ryan Blue <rdb...@gmail.com> wrote: >> >>> >> >>> > I think it does not make sense to stick manifest files to Avro if >> we break column stats into sub fields. >> >>> >> >>> This isn't necessarily true. Avro can benefit from better pushdown >> with Eduard's approach as well by being able to skip more efficiently. With >> the current layout, Avro stores a list of key/value pairs that are all >> projected and put into a map. We avoid decoding the values, but each field >> ID is decoded, then the length of the value is decoded, and finally there >> is a put operation with an ID and value ByteBuffer pair. With the new >> approach, we will be able to know which fields are relevant and skip >> unprojected fields based on the file schema, which we couldn't do before. >> >>> >> >>> To skip stats for an unused field (not part of the filter), there are >> two cases. Lower/upper bounds for types that are fixed width are skipped by >> updating the read position. And bounds for types that are variable length >> (strings and binary) are skipped by reading the length and skipping that >> number of bytes. >> >>> >> >>> It turns out that actually producing the metric maps is a fairly >> expensive operation, so being able to skip metrics more quickly even if the >> bytes still have to be read is going to save time. That said, using a >> columnar format is still going to be a good idea! >> >>> >> >>> On Wed, Jun 4, 2025 at 11:22 PM Gang Wu <ust...@gmail.com> wrote: >> >>>> >> >>>> > Together with the change which allows storing metadata in columnar >> formats >> >>>> >> >>>> +1 on this. I think it does not make sense to stick manifest files >> to Avro if we break column stats into sub fields. >> >>>> >> >>>> On Tue, Jun 3, 2025 at 7:19 PM Péter Váry < >> peter.vary.apa...@gmail.com> wrote: >> >>>>> >> >>>>> I would love to see more flexibility in file stats. Together with >> the change which allows storing metadata in columnar formats will open up >> many new possibilities. Bloom filters in metadata which could be used for >> filtering out files, HLL scratches etc.... >> >>>>> >> >>>>> +1 for the change >> >>>>> >> >>>>> On Tue, Jun 3, 2025, 08:12 Szehon Ho <szehon.apa...@gmail.com> >> wrote: >> >>>>>> >> >>>>>> +1 , excited for this one too, we've seen the current metrics maps >> blow up the memory and hope can improve that. >> >>>>>> >> >>>>>> On the Geo front, this could allow us to add supplementary metrics >> that don't conform to the geo type, like S2 Cell Ids. >> >>>>>> >> >>>>>> Thanks >> >>>>>> Szehon >> >>>>>> >> >>>>>> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner < >> etudenhoef...@apache.org> wrote: >> >>>>>>> >> >>>>>>> Hey everyone, >> >>>>>>> >> >>>>>>> I'm starting a thread to connect folks interested in improving >> the existing way of collecting column-level statistics (often referred to >> as metrics in the code). I've already started a proposal, which can be >> found at https://s.apache.org/iceberg-column-stats. >> >>>>>>> >> >>>>>>> Motivation >> >>>>>>> >> >>>>>>> Column statistics are currently stored as a mapping of field id >> to values across multiple columns (lower/upper bounds, value/nan/null >> counts, sizes). This storage model has critical limitations as the number >> of columns increases and as new types are being added to Iceberg: >> >>>>>>> >> >>>>>>> Inefficient Storage due to map-based structure: >> >>>>>>> >> >>>>>>> Large memory overhead during planning/processing >> >>>>>>> >> >>>>>>> Inability to project specific stats (e.g., only null_value_counts >> for column X) >> >>>>>>> >> >>>>>>> Type Erasure: Original logical/physical types are lost when >> stored as binary blobs, causing: >> >>>>>>> >> >>>>>>> Lossy type inference during reads >> >>>>>>> >> >>>>>>> Schema evolution challenges (e.g., widening types) >> >>>>>>> >> >>>>>>> Rigid Schema: Stats are tied to the data_fil entry record, >> limiting extensibility for new stats. >> >>>>>>> >> >>>>>>> >> >>>>>>> Goals >> >>>>>>> >> >>>>>>> Improve the column stats representation to allow for the >> following: >> >>>>>>> >> >>>>>>> Projectability: Enable independent access to specific stats >> (e.g., lower_bounds without loading upper_bounds). >> >>>>>>> >> >>>>>>> Type Preservation: Store original data types to support accurate >> reads and schema evolution. >> >>>>>>> >> >>>>>>> Flexible/Extensible Representation: Allow per-field stats >> structures (e.g., complex types like Geo/Variant). >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> Thanks >> >>>>>>> Eduard >> >