It seems reasonable to support stats for computed/calculated columns with assigned field ids.
E.g., Flink has "computed columns" for a long time. https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/#columns CREATE TABLE MyTable ( `user_id` BIGINT, `price` DOUBLE, `quantity` DOUBLE, `cost` AS price * quantity -- evaluate expression and supply the result to queries) WITH ( 'connector' = 'kafka' ...); On Tue, Jul 22, 2025 at 10:56 AM Russell Spitzer <russell.spit...@gmail.com> wrote: > As long as enough folks are on board for assigning table field-id's to > non-physical columns I have no problem with that approach. > > On Tue, Jul 22, 2025 at 12:34 PM Ryan Blue <rdb...@gmail.com> wrote: > >> > I still think that the current `10000 + 200 * data_field_id` id >> generation logic is very limiting. I have already seen tables with more >> than 10k columns for AI space. And I don't think we would like to create >> hard limits here. >> >> If we include only positive integers for IDs then the max ID is 2^31-1 = >> 2,147,483,647. The last 200 fields are reserved and this is proposing to >> skip the first 10,000 IDs for other structures in the manifest file schema. >> That can accommodate more than 10.7 million field IDs from the table space >> (= (2,147,483,647 - 10,200) / 200). I think that is a reasonable upper >> bound for the number of table columns. >> >> > I think this would probably a good opportunity to also reserve space >> for metrics which apply to the Sort Order transformation used to create a >> file. >> >> I've also been thinking about being able to store stats for other >> expressions in this space, including partition values that are part of >> Amogh's one-file commit and adaptive metadata proposal. Another use case is >> keeping stats for derived values, like `to_lower(string_col)`. What I'd >> propose is being able to track expressions that are assigned an ID from the >> table column space, then storing the expressions stats according to the >> structure proposed here. But I wouldn't over-complicate the current >> proposal by adding this just yet. We can talk about extending stats to >> expressions in another proposal once we have the structure and ID >> assignment done. >> >> On Tue, Jul 22, 2025 at 9:59 AM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> I'm also sorry I missed the discussion because I was busy trying to keep >>> a nearly-3 year old occupied on a plane :) >>> >>> I think the proposal is pretty strong although I have one request. >>> Currently, we have the ability to note the sort order which >>> was applied to a data file but we have know way of knowing the >>> statistics of that applied transformation for said data file. >>> This has stopped us from doing a number of optimizations (like combining >>> files which are adjacent based on any complex >>> sort ordering). I think this would probably a good opportunity to also >>> reserve space for metrics which apply to the Sort Order >>> transformation used to create a file. This would still only be using >>> things that are already codified in the spec but >>> would make it possible for engines to use those transforms for further >>> predicate pushdown or optimization >>> in file compaction. >>> >>> A quick example >>> >>> Say I am using a hierarchical sort order (A, B, C) >>> I could then store the max, min of this transform which would be >>> independent to individual column maxes >>> Say >>> A : Min 1 A: Max 2 >>> B: Min 1 B: Max 100000 >>> C: Min 1 C: Max 100000 >>> >>> A,B,C Min : (1,7000,32) >>> A,B,C Max : (2, 1, 100000) >>> >>> In this case if I'm looking for a record 2, 2, 4, I can instantly reject >>> using the sort order transform where if I was >>> using the individual columns I would have to read the file. >>> >>> This is of course also useful if the sort order is using some kind of >>> space filling curve or other clustering algorithm. >>> >>> Thanks for your hard work, >>> Russ >>> >>> On Thu, Jul 17, 2025 at 5:07 AM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> Hi Team, >>>> Sorry, but I was not able to join the discussion on Tuesday :(, but I >>>> listened to the recording. >>>> >>>> A few thoughts: >>>> - I still think that the current `10000 + 200 * data_field_id` id >>>> generation logic is very limiting. I have already seen tables with more >>>> than 10k columns for AI space. And I don't think we would like to create >>>> hard limits here. >>>> - If we follow store a few things in metadata (starting_id, >>>> stats_types_present, columns_with_stats), then we can greatly reduce the >>>> used id space, and also allow for extensibility, as the engines which >>>> doesn't know a specific stats_type could just ignore them) >>>> >>>> Thanks Eduard and Dan for driving this! >>>> Peter >>>> >>>> Eduard Tudenhöfner <etudenhoef...@apache.org> ezt írta (időpont: 2025. >>>> júl. 16., Sze, 7:53): >>>> >>>>> Hey everyone, >>>>> >>>>> We met yesterday and talked about the column stats proposal. >>>>> Please find the recording here >>>>> <https://drive.google.com/file/d/1WVpSg9XxipO5NzogDc7D4DMsnj9cCMlF/view?usp=sharing> >>>>> and the notes here >>>>> <https://docs.google.com/document/d/1s9_o_Y8js4kHVCYI2OeL0Yh3Ey0qfkuE7Mm-z-QoEfo/edit?usp=sharing> >>>>> . >>>>> >>>>> Thanks everyone, >>>>> Eduard >>>>> >>>>> On Tue, Jul 8, 2025 at 6:51 PM Eduard Tudenhöfner < >>>>> etudenhoef...@apache.org> wrote: >>>>> >>>>>> Hey everyone, >>>>>> >>>>>> I've just added an event to the dev calendar for July 15 at 9am (PT) >>>>>> to discuss the column stats proposal. >>>>>> >>>>>> >>>>>> Eduard >>>>>> >>>>>> On Tue, Jul 8, 2025 at 4:09 AM Jacky Lee <qcsd2...@gmail.com> wrote: >>>>>> >>>>>>> +1 for the wonderful feature. Please count me in if you need any >>>>>>> help. >>>>>>> >>>>>>> Gábor Kaszab <gaborkas...@apache.org> 于2025年7月7日周一 21:22写道: >>>>>>> > >>>>>>> > +1 Seems a great improvement! Let me know if I can help out with >>>>>>> implementation, measurements, etc.! >>>>>>> > >>>>>>> > Regards, >>>>>>> > Gabor Kaszab >>>>>>> > >>>>>>> > John Zhuge <jzh...@apache.org> ezt írta (időpont: 2025. jún. 5., >>>>>>> Cs, 23:41): >>>>>>> >> >>>>>>> >> +1 Looking forward to this feature >>>>>>> >> >>>>>>> >> John Zhuge >>>>>>> >> >>>>>>> >> >>>>>>> >> On Thu, Jun 5, 2025 at 2:22 PM Ryan Blue <rdb...@gmail.com> >>>>>>> wrote: >>>>>>> >>> >>>>>>> >>> > I think it does not make sense to stick manifest files to Avro >>>>>>> if we break column stats into sub fields. >>>>>>> >>> >>>>>>> >>> This isn't necessarily true. Avro can benefit from better >>>>>>> pushdown with Eduard's approach as well by being able to skip more >>>>>>> efficiently. With the current layout, Avro stores a list of key/value >>>>>>> pairs >>>>>>> that are all projected and put into a map. We avoid decoding the values, >>>>>>> but each field ID is decoded, then the length of the value is decoded, >>>>>>> and >>>>>>> finally there is a put operation with an ID and value ByteBuffer pair. >>>>>>> With >>>>>>> the new approach, we will be able to know which fields are relevant and >>>>>>> skip unprojected fields based on the file schema, which we couldn't do >>>>>>> before. >>>>>>> >>> >>>>>>> >>> To skip stats for an unused field (not part of the filter), >>>>>>> there are two cases. Lower/upper bounds for types that are fixed width >>>>>>> are >>>>>>> skipped by updating the read position. And bounds for types that are >>>>>>> variable length (strings and binary) are skipped by reading the length >>>>>>> and >>>>>>> skipping that number of bytes. >>>>>>> >>> >>>>>>> >>> It turns out that actually producing the metric maps is a fairly >>>>>>> expensive operation, so being able to skip metrics more quickly even if >>>>>>> the >>>>>>> bytes still have to be read is going to save time. That said, using a >>>>>>> columnar format is still going to be a good idea! >>>>>>> >>> >>>>>>> >>> On Wed, Jun 4, 2025 at 11:22 PM Gang Wu <ust...@gmail.com> >>>>>>> wrote: >>>>>>> >>>> >>>>>>> >>>> > Together with the change which allows storing metadata in >>>>>>> columnar formats >>>>>>> >>>> >>>>>>> >>>> +1 on this. I think it does not make sense to stick manifest >>>>>>> files to Avro if we break column stats into sub fields. >>>>>>> >>>> >>>>>>> >>>> On Tue, Jun 3, 2025 at 7:19 PM Péter Váry < >>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>> >>>>> >>>>>>> >>>>> I would love to see more flexibility in file stats. Together >>>>>>> with the change which allows storing metadata in columnar formats will >>>>>>> open >>>>>>> up many new possibilities. Bloom filters in metadata which could be used >>>>>>> for filtering out files, HLL scratches etc.... >>>>>>> >>>>> >>>>>>> >>>>> +1 for the change >>>>>>> >>>>> >>>>>>> >>>>> On Tue, Jun 3, 2025, 08:12 Szehon Ho <szehon.apa...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>> >>>>>>> >>>>>> +1 , excited for this one too, we've seen the current metrics >>>>>>> maps blow up the memory and hope can improve that. >>>>>>> >>>>>> >>>>>>> >>>>>> On the Geo front, this could allow us to add supplementary >>>>>>> metrics that don't conform to the geo type, like S2 Cell Ids. >>>>>>> >>>>>> >>>>>>> >>>>>> Thanks >>>>>>> >>>>>> Szehon >>>>>>> >>>>>> >>>>>>> >>>>>> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner < >>>>>>> etudenhoef...@apache.org> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hey everyone, >>>>>>> >>>>>>> >>>>>>> >>>>>>> I'm starting a thread to connect folks interested in >>>>>>> improving the existing way of collecting column-level statistics (often >>>>>>> referred to as metrics in the code). I've already started a proposal, >>>>>>> which >>>>>>> can be found at https://s.apache.org/iceberg-column-stats. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Motivation >>>>>>> >>>>>>> >>>>>>> >>>>>>> Column statistics are currently stored as a mapping of field >>>>>>> id to values across multiple columns (lower/upper bounds, value/nan/null >>>>>>> counts, sizes). This storage model has critical limitations as the >>>>>>> number >>>>>>> of columns increases and as new types are being added to Iceberg: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Inefficient Storage due to map-based structure: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Large memory overhead during planning/processing >>>>>>> >>>>>>> >>>>>>> >>>>>>> Inability to project specific stats (e.g., only >>>>>>> null_value_counts for column X) >>>>>>> >>>>>>> >>>>>>> >>>>>>> Type Erasure: Original logical/physical types are lost when >>>>>>> stored as binary blobs, causing: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Lossy type inference during reads >>>>>>> >>>>>>> >>>>>>> >>>>>>> Schema evolution challenges (e.g., widening types) >>>>>>> >>>>>>> >>>>>>> >>>>>>> Rigid Schema: Stats are tied to the data_fil entry record, >>>>>>> limiting extensibility for new stats. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Goals >>>>>>> >>>>>>> >>>>>>> >>>>>>> Improve the column stats representation to allow for the >>>>>>> following: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Projectability: Enable independent access to specific stats >>>>>>> (e.g., lower_bounds without loading upper_bounds). >>>>>>> >>>>>>> >>>>>>> >>>>>>> Type Preservation: Store original data types to support >>>>>>> accurate reads and schema evolution. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Flexible/Extensible Representation: Allow per-field stats >>>>>>> structures (e.g., complex types like Geo/Variant). >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> Eduard >>>>>>> >>>>>>