Hi Team, Sorry, but I was not able to join the discussion on Tuesday :(, but I listened to the recording.
A few thoughts: - I still think that the current `10000 + 200 * data_field_id` id generation logic is very limiting. I have already seen tables with more than 10k columns for AI space. And I don't think we would like to create hard limits here. - If we follow store a few things in metadata (starting_id, stats_types_present, columns_with_stats), then we can greatly reduce the used id space, and also allow for extensibility, as the engines which doesn't know a specific stats_type could just ignore them) Thanks Eduard and Dan for driving this! Peter Eduard Tudenhöfner <etudenhoef...@apache.org> ezt írta (időpont: 2025. júl. 16., Sze, 7:53): > Hey everyone, > > We met yesterday and talked about the column stats proposal. > Please find the recording here > <https://drive.google.com/file/d/1WVpSg9XxipO5NzogDc7D4DMsnj9cCMlF/view?usp=sharing> > and the notes here > <https://docs.google.com/document/d/1s9_o_Y8js4kHVCYI2OeL0Yh3Ey0qfkuE7Mm-z-QoEfo/edit?usp=sharing> > . > > Thanks everyone, > Eduard > > On Tue, Jul 8, 2025 at 6:51 PM Eduard Tudenhöfner < > etudenhoef...@apache.org> wrote: > >> Hey everyone, >> >> I've just added an event to the dev calendar for July 15 at 9am (PT) to >> discuss the column stats proposal. >> >> >> Eduard >> >> On Tue, Jul 8, 2025 at 4:09 AM Jacky Lee <qcsd2...@gmail.com> wrote: >> >>> +1 for the wonderful feature. Please count me in if you need any help. >>> >>> Gábor Kaszab <gaborkas...@apache.org> 于2025年7月7日周一 21:22写道: >>> > >>> > +1 Seems a great improvement! Let me know if I can help out with >>> implementation, measurements, etc.! >>> > >>> > Regards, >>> > Gabor Kaszab >>> > >>> > John Zhuge <jzh...@apache.org> ezt írta (időpont: 2025. jún. 5., Cs, >>> 23:41): >>> >> >>> >> +1 Looking forward to this feature >>> >> >>> >> John Zhuge >>> >> >>> >> >>> >> On Thu, Jun 5, 2025 at 2:22 PM Ryan Blue <rdb...@gmail.com> wrote: >>> >>> >>> >>> > I think it does not make sense to stick manifest files to Avro if >>> we break column stats into sub fields. >>> >>> >>> >>> This isn't necessarily true. Avro can benefit from better pushdown >>> with Eduard's approach as well by being able to skip more efficiently. With >>> the current layout, Avro stores a list of key/value pairs that are all >>> projected and put into a map. We avoid decoding the values, but each field >>> ID is decoded, then the length of the value is decoded, and finally there >>> is a put operation with an ID and value ByteBuffer pair. With the new >>> approach, we will be able to know which fields are relevant and skip >>> unprojected fields based on the file schema, which we couldn't do before. >>> >>> >>> >>> To skip stats for an unused field (not part of the filter), there >>> are two cases. Lower/upper bounds for types that are fixed width are >>> skipped by updating the read position. And bounds for types that are >>> variable length (strings and binary) are skipped by reading the length and >>> skipping that number of bytes. >>> >>> >>> >>> It turns out that actually producing the metric maps is a fairly >>> expensive operation, so being able to skip metrics more quickly even if the >>> bytes still have to be read is going to save time. That said, using a >>> columnar format is still going to be a good idea! >>> >>> >>> >>> On Wed, Jun 4, 2025 at 11:22 PM Gang Wu <ust...@gmail.com> wrote: >>> >>>> >>> >>>> > Together with the change which allows storing metadata in >>> columnar formats >>> >>>> >>> >>>> +1 on this. I think it does not make sense to stick manifest files >>> to Avro if we break column stats into sub fields. >>> >>>> >>> >>>> On Tue, Jun 3, 2025 at 7:19 PM Péter Váry < >>> peter.vary.apa...@gmail.com> wrote: >>> >>>>> >>> >>>>> I would love to see more flexibility in file stats. Together with >>> the change which allows storing metadata in columnar formats will open up >>> many new possibilities. Bloom filters in metadata which could be used for >>> filtering out files, HLL scratches etc.... >>> >>>>> >>> >>>>> +1 for the change >>> >>>>> >>> >>>>> On Tue, Jun 3, 2025, 08:12 Szehon Ho <szehon.apa...@gmail.com> >>> wrote: >>> >>>>>> >>> >>>>>> +1 , excited for this one too, we've seen the current metrics >>> maps blow up the memory and hope can improve that. >>> >>>>>> >>> >>>>>> On the Geo front, this could allow us to add supplementary >>> metrics that don't conform to the geo type, like S2 Cell Ids. >>> >>>>>> >>> >>>>>> Thanks >>> >>>>>> Szehon >>> >>>>>> >>> >>>>>> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner < >>> etudenhoef...@apache.org> wrote: >>> >>>>>>> >>> >>>>>>> Hey everyone, >>> >>>>>>> >>> >>>>>>> I'm starting a thread to connect folks interested in improving >>> the existing way of collecting column-level statistics (often referred to >>> as metrics in the code). I've already started a proposal, which can be >>> found at https://s.apache.org/iceberg-column-stats. >>> >>>>>>> >>> >>>>>>> Motivation >>> >>>>>>> >>> >>>>>>> Column statistics are currently stored as a mapping of field id >>> to values across multiple columns (lower/upper bounds, value/nan/null >>> counts, sizes). This storage model has critical limitations as the number >>> of columns increases and as new types are being added to Iceberg: >>> >>>>>>> >>> >>>>>>> Inefficient Storage due to map-based structure: >>> >>>>>>> >>> >>>>>>> Large memory overhead during planning/processing >>> >>>>>>> >>> >>>>>>> Inability to project specific stats (e.g., only >>> null_value_counts for column X) >>> >>>>>>> >>> >>>>>>> Type Erasure: Original logical/physical types are lost when >>> stored as binary blobs, causing: >>> >>>>>>> >>> >>>>>>> Lossy type inference during reads >>> >>>>>>> >>> >>>>>>> Schema evolution challenges (e.g., widening types) >>> >>>>>>> >>> >>>>>>> Rigid Schema: Stats are tied to the data_fil entry record, >>> limiting extensibility for new stats. >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> Goals >>> >>>>>>> >>> >>>>>>> Improve the column stats representation to allow for the >>> following: >>> >>>>>>> >>> >>>>>>> Projectability: Enable independent access to specific stats >>> (e.g., lower_bounds without loading upper_bounds). >>> >>>>>>> >>> >>>>>>> Type Preservation: Store original data types to support accurate >>> reads and schema evolution. >>> >>>>>>> >>> >>>>>>> Flexible/Extensible Representation: Allow per-field stats >>> structures (e.g., complex types like Geo/Variant). >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> Thanks >>> >>>>>>> Eduard >>> >>