As long as enough folks are on board for assigning table field-id's to non-physical columns I have no problem with that approach.
On Tue, Jul 22, 2025 at 12:34 PM Ryan Blue <rdb...@gmail.com> wrote: > > I still think that the current `10000 + 200 * data_field_id` id > generation logic is very limiting. I have already seen tables with more > than 10k columns for AI space. And I don't think we would like to create > hard limits here. > > If we include only positive integers for IDs then the max ID is 2^31-1 = > 2,147,483,647. The last 200 fields are reserved and this is proposing to > skip the first 10,000 IDs for other structures in the manifest file schema. > That can accommodate more than 10.7 million field IDs from the table space > (= (2,147,483,647 - 10,200) / 200). I think that is a reasonable upper > bound for the number of table columns. > > > I think this would probably a good opportunity to also reserve space for > metrics which apply to the Sort Order transformation used to create a file. > > I've also been thinking about being able to store stats for other > expressions in this space, including partition values that are part of > Amogh's one-file commit and adaptive metadata proposal. Another use case is > keeping stats for derived values, like `to_lower(string_col)`. What I'd > propose is being able to track expressions that are assigned an ID from the > table column space, then storing the expressions stats according to the > structure proposed here. But I wouldn't over-complicate the current > proposal by adding this just yet. We can talk about extending stats to > expressions in another proposal once we have the structure and ID > assignment done. > > On Tue, Jul 22, 2025 at 9:59 AM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> I'm also sorry I missed the discussion because I was busy trying to keep >> a nearly-3 year old occupied on a plane :) >> >> I think the proposal is pretty strong although I have one request. >> Currently, we have the ability to note the sort order which >> was applied to a data file but we have know way of knowing the statistics >> of that applied transformation for said data file. >> This has stopped us from doing a number of optimizations (like combining >> files which are adjacent based on any complex >> sort ordering). I think this would probably a good opportunity to also >> reserve space for metrics which apply to the Sort Order >> transformation used to create a file. This would still only be using >> things that are already codified in the spec but >> would make it possible for engines to use those transforms for further >> predicate pushdown or optimization >> in file compaction. >> >> A quick example >> >> Say I am using a hierarchical sort order (A, B, C) >> I could then store the max, min of this transform which would be >> independent to individual column maxes >> Say >> A : Min 1 A: Max 2 >> B: Min 1 B: Max 100000 >> C: Min 1 C: Max 100000 >> >> A,B,C Min : (1,7000,32) >> A,B,C Max : (2, 1, 100000) >> >> In this case if I'm looking for a record 2, 2, 4, I can instantly reject >> using the sort order transform where if I was >> using the individual columns I would have to read the file. >> >> This is of course also useful if the sort order is using some kind of >> space filling curve or other clustering algorithm. >> >> Thanks for your hard work, >> Russ >> >> On Thu, Jul 17, 2025 at 5:07 AM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >>> Hi Team, >>> Sorry, but I was not able to join the discussion on Tuesday :(, but I >>> listened to the recording. >>> >>> A few thoughts: >>> - I still think that the current `10000 + 200 * data_field_id` id >>> generation logic is very limiting. I have already seen tables with more >>> than 10k columns for AI space. And I don't think we would like to create >>> hard limits here. >>> - If we follow store a few things in metadata (starting_id, >>> stats_types_present, columns_with_stats), then we can greatly reduce the >>> used id space, and also allow for extensibility, as the engines which >>> doesn't know a specific stats_type could just ignore them) >>> >>> Thanks Eduard and Dan for driving this! >>> Peter >>> >>> Eduard Tudenhöfner <etudenhoef...@apache.org> ezt írta (időpont: 2025. >>> júl. 16., Sze, 7:53): >>> >>>> Hey everyone, >>>> >>>> We met yesterday and talked about the column stats proposal. >>>> Please find the recording here >>>> <https://drive.google.com/file/d/1WVpSg9XxipO5NzogDc7D4DMsnj9cCMlF/view?usp=sharing> >>>> and the notes here >>>> <https://docs.google.com/document/d/1s9_o_Y8js4kHVCYI2OeL0Yh3Ey0qfkuE7Mm-z-QoEfo/edit?usp=sharing> >>>> . >>>> >>>> Thanks everyone, >>>> Eduard >>>> >>>> On Tue, Jul 8, 2025 at 6:51 PM Eduard Tudenhöfner < >>>> etudenhoef...@apache.org> wrote: >>>> >>>>> Hey everyone, >>>>> >>>>> I've just added an event to the dev calendar for July 15 at 9am (PT) >>>>> to discuss the column stats proposal. >>>>> >>>>> >>>>> Eduard >>>>> >>>>> On Tue, Jul 8, 2025 at 4:09 AM Jacky Lee <qcsd2...@gmail.com> wrote: >>>>> >>>>>> +1 for the wonderful feature. Please count me in if you need any help. >>>>>> >>>>>> Gábor Kaszab <gaborkas...@apache.org> 于2025年7月7日周一 21:22写道: >>>>>> > >>>>>> > +1 Seems a great improvement! Let me know if I can help out with >>>>>> implementation, measurements, etc.! >>>>>> > >>>>>> > Regards, >>>>>> > Gabor Kaszab >>>>>> > >>>>>> > John Zhuge <jzh...@apache.org> ezt írta (időpont: 2025. jún. 5., >>>>>> Cs, 23:41): >>>>>> >> >>>>>> >> +1 Looking forward to this feature >>>>>> >> >>>>>> >> John Zhuge >>>>>> >> >>>>>> >> >>>>>> >> On Thu, Jun 5, 2025 at 2:22 PM Ryan Blue <rdb...@gmail.com> wrote: >>>>>> >>> >>>>>> >>> > I think it does not make sense to stick manifest files to Avro >>>>>> if we break column stats into sub fields. >>>>>> >>> >>>>>> >>> This isn't necessarily true. Avro can benefit from better >>>>>> pushdown with Eduard's approach as well by being able to skip more >>>>>> efficiently. With the current layout, Avro stores a list of key/value >>>>>> pairs >>>>>> that are all projected and put into a map. We avoid decoding the values, >>>>>> but each field ID is decoded, then the length of the value is decoded, >>>>>> and >>>>>> finally there is a put operation with an ID and value ByteBuffer pair. >>>>>> With >>>>>> the new approach, we will be able to know which fields are relevant and >>>>>> skip unprojected fields based on the file schema, which we couldn't do >>>>>> before. >>>>>> >>> >>>>>> >>> To skip stats for an unused field (not part of the filter), there >>>>>> are two cases. Lower/upper bounds for types that are fixed width are >>>>>> skipped by updating the read position. And bounds for types that are >>>>>> variable length (strings and binary) are skipped by reading the length >>>>>> and >>>>>> skipping that number of bytes. >>>>>> >>> >>>>>> >>> It turns out that actually producing the metric maps is a fairly >>>>>> expensive operation, so being able to skip metrics more quickly even if >>>>>> the >>>>>> bytes still have to be read is going to save time. That said, using a >>>>>> columnar format is still going to be a good idea! >>>>>> >>> >>>>>> >>> On Wed, Jun 4, 2025 at 11:22 PM Gang Wu <ust...@gmail.com> wrote: >>>>>> >>>> >>>>>> >>>> > Together with the change which allows storing metadata in >>>>>> columnar formats >>>>>> >>>> >>>>>> >>>> +1 on this. I think it does not make sense to stick manifest >>>>>> files to Avro if we break column stats into sub fields. >>>>>> >>>> >>>>>> >>>> On Tue, Jun 3, 2025 at 7:19 PM Péter Váry < >>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>> >>>>> >>>>>> >>>>> I would love to see more flexibility in file stats. Together >>>>>> with the change which allows storing metadata in columnar formats will >>>>>> open >>>>>> up many new possibilities. Bloom filters in metadata which could be used >>>>>> for filtering out files, HLL scratches etc.... >>>>>> >>>>> >>>>>> >>>>> +1 for the change >>>>>> >>>>> >>>>>> >>>>> On Tue, Jun 3, 2025, 08:12 Szehon Ho <szehon.apa...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> +1 , excited for this one too, we've seen the current metrics >>>>>> maps blow up the memory and hope can improve that. >>>>>> >>>>>> >>>>>> >>>>>> On the Geo front, this could allow us to add supplementary >>>>>> metrics that don't conform to the geo type, like S2 Cell Ids. >>>>>> >>>>>> >>>>>> >>>>>> Thanks >>>>>> >>>>>> Szehon >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner < >>>>>> etudenhoef...@apache.org> wrote: >>>>>> >>>>>>> >>>>>> >>>>>>> Hey everyone, >>>>>> >>>>>>> >>>>>> >>>>>>> I'm starting a thread to connect folks interested in >>>>>> improving the existing way of collecting column-level statistics (often >>>>>> referred to as metrics in the code). I've already started a proposal, >>>>>> which >>>>>> can be found at https://s.apache.org/iceberg-column-stats. >>>>>> >>>>>>> >>>>>> >>>>>>> Motivation >>>>>> >>>>>>> >>>>>> >>>>>>> Column statistics are currently stored as a mapping of field >>>>>> id to values across multiple columns (lower/upper bounds, value/nan/null >>>>>> counts, sizes). This storage model has critical limitations as the number >>>>>> of columns increases and as new types are being added to Iceberg: >>>>>> >>>>>>> >>>>>> >>>>>>> Inefficient Storage due to map-based structure: >>>>>> >>>>>>> >>>>>> >>>>>>> Large memory overhead during planning/processing >>>>>> >>>>>>> >>>>>> >>>>>>> Inability to project specific stats (e.g., only >>>>>> null_value_counts for column X) >>>>>> >>>>>>> >>>>>> >>>>>>> Type Erasure: Original logical/physical types are lost when >>>>>> stored as binary blobs, causing: >>>>>> >>>>>>> >>>>>> >>>>>>> Lossy type inference during reads >>>>>> >>>>>>> >>>>>> >>>>>>> Schema evolution challenges (e.g., widening types) >>>>>> >>>>>>> >>>>>> >>>>>>> Rigid Schema: Stats are tied to the data_fil entry record, >>>>>> limiting extensibility for new stats. >>>>>> >>>>>>> >>>>>> >>>>>>> >>>>>> >>>>>>> Goals >>>>>> >>>>>>> >>>>>> >>>>>>> Improve the column stats representation to allow for the >>>>>> following: >>>>>> >>>>>>> >>>>>> >>>>>>> Projectability: Enable independent access to specific stats >>>>>> (e.g., lower_bounds without loading upper_bounds). >>>>>> >>>>>>> >>>>>> >>>>>>> Type Preservation: Store original data types to support >>>>>> accurate reads and schema evolution. >>>>>> >>>>>>> >>>>>> >>>>>>> Flexible/Extensible Representation: Allow per-field stats >>>>>> structures (e.g., complex types like Geo/Variant). >>>>>> >>>>>>> >>>>>> >>>>>>> >>>>>> >>>>>>> >>>>>> >>>>>>> Thanks >>>>>> >>>>>>> Eduard >>>>>> >>>>>