Re: [DISCUSS] v4 - Improved column statistics

Steven Wu Tue, 22 Jul 2025 11:21:36 -0700

It seems reasonable to support stats for computed/calculated columns with
assigned field ids.


E.g., Flink has "computed columns" for a long time.
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/#columns

CREATE TABLE MyTable (  `user_id` BIGINT,  `price` DOUBLE,  `quantity`
DOUBLE,  `cost` AS price * quantity  -- evaluate expression and supply
the result to queries) WITH (  'connector' = 'kafka'  ...);


On Tue, Jul 22, 2025 at 10:56 AM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> As long as enough folks are on board for assigning table field-id's to
> non-physical columns I have no problem with that approach.
>
> On Tue, Jul 22, 2025 at 12:34 PM Ryan Blue <rdb...@gmail.com> wrote:
>
>> > I still think that the current `10000 + 200 * data_field_id` id
>> generation logic is very limiting. I have already seen tables with more
>> than 10k columns for AI space. And I don't think we would like to create
>> hard limits here.
>>
>> If we include only positive integers for IDs then the max ID is 2^31-1 =
>> 2,147,483,647. The last 200 fields are reserved and this is proposing to
>> skip the first 10,000 IDs for other structures in the manifest file schema.
>> That can accommodate more than 10.7 million field IDs from the table space
>> (= (2,147,483,647 - 10,200) / 200). I think that is a reasonable upper
>> bound for the number of table columns.
>>
>> > I think this would probably a good opportunity to also reserve space
>> for metrics which apply to the Sort Order transformation used to create a
>> file.
>>
>> I've also been thinking about being able to store stats for other
>> expressions in this space, including partition values that are part of
>> Amogh's one-file commit and adaptive metadata proposal. Another use case is
>> keeping stats for derived values, like `to_lower(string_col)`. What I'd
>> propose is being able to track expressions that are assigned an ID from the
>> table column space, then storing the expressions stats according to the
>> structure proposed here. But I wouldn't over-complicate the current
>> proposal by adding this just yet. We can talk about extending stats to
>> expressions in another proposal once we have the structure and ID
>> assignment done.
>>
>> On Tue, Jul 22, 2025 at 9:59 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> I'm also sorry I missed the discussion because I was busy trying to keep
>>> a nearly-3 year old occupied on a plane :)
>>>
>>> I think the proposal is pretty strong although I have one request.
>>> Currently, we have the ability to note the sort order which
>>> was applied to a data file but we have know way of knowing the
>>> statistics of that applied transformation for said data file.
>>> This has stopped us from doing a number of optimizations (like combining
>>> files which are adjacent based on any complex
>>> sort ordering). I think this would probably a good opportunity to also
>>> reserve space for metrics which apply to the Sort Order
>>> transformation used to create a file. This would still only be using
>>> things that are already codified in the spec but
>>> would make it possible for engines to use those transforms for further
>>> predicate pushdown or optimization
>>> in file compaction.
>>>
>>> A quick example
>>>
>>> Say I am using a hierarchical sort order (A, B, C)
>>> I could then store the max, min of this transform which would be
>>> independent to individual column maxes
>>> Say
>>> A : Min 1 A: Max 2
>>> B: Min 1 B: Max 100000
>>> C: Min 1 C: Max 100000
>>>
>>> A,B,C Min : (1,7000,32)
>>> A,B,C Max : (2, 1, 100000)
>>>
>>> In this case if I'm looking for a record 2, 2, 4, I can instantly reject
>>> using the sort order transform where if I was
>>> using the individual columns I would have to read the file.
>>>
>>> This is of course also useful if the sort order is using some kind of
>>> space filling curve or other clustering algorithm.
>>>
>>> Thanks for your hard work,
>>> Russ
>>>
>>> On Thu, Jul 17, 2025 at 5:07 AM Péter Váry <peter.vary.apa...@gmail.com>
>>> wrote:
>>>
>>>> Hi Team,
>>>> Sorry, but I was not able to join the discussion on Tuesday :(, but I
>>>> listened to the recording.
>>>>
>>>> A few thoughts:
>>>> - I still think that the current `10000 + 200 * data_field_id` id
>>>> generation logic is very limiting. I have already seen tables with more
>>>> than 10k columns for AI space. And I don't think we would like to create
>>>> hard limits here.
>>>> - If we follow store a few things in metadata (starting_id,
>>>> stats_types_present, columns_with_stats), then we can greatly reduce the
>>>> used id space, and also allow for extensibility, as the engines which
>>>> doesn't know a specific stats_type could just ignore them)
>>>>
>>>> Thanks Eduard and Dan for driving this!
>>>> Peter
>>>>
>>>> Eduard Tudenhöfner <etudenhoef...@apache.org> ezt írta (időpont: 2025.
>>>> júl. 16., Sze, 7:53):
>>>>
>>>>> Hey everyone,
>>>>>
>>>>> We met yesterday and talked about the column stats proposal.
>>>>> Please find the recording here
>>>>> <https://drive.google.com/file/d/1WVpSg9XxipO5NzogDc7D4DMsnj9cCMlF/view?usp=sharing>
>>>>> and the notes here
>>>>> <https://docs.google.com/document/d/1s9_o_Y8js4kHVCYI2OeL0Yh3Ey0qfkuE7Mm-z-QoEfo/edit?usp=sharing>
>>>>> .
>>>>>
>>>>> Thanks everyone,
>>>>> Eduard
>>>>>
>>>>> On Tue, Jul 8, 2025 at 6:51 PM Eduard Tudenhöfner <
>>>>> etudenhoef...@apache.org> wrote:
>>>>>
>>>>>> Hey everyone,
>>>>>>
>>>>>> I've just added an event to the dev calendar for July 15 at 9am (PT)
>>>>>> to discuss the column stats proposal.
>>>>>>
>>>>>>
>>>>>> Eduard
>>>>>>
>>>>>> On Tue, Jul 8, 2025 at 4:09 AM Jacky Lee <qcsd2...@gmail.com> wrote:
>>>>>>
>>>>>>> +1 for the wonderful feature. Please count me in if you need any
>>>>>>> help.
>>>>>>>
>>>>>>> Gábor Kaszab <gaborkas...@apache.org> 于2025年7月7日周一 21:22写道：
>>>>>>> >
>>>>>>> > +1 Seems a great improvement! Let me know if I can help out with
>>>>>>> implementation, measurements, etc.!
>>>>>>> >
>>>>>>> > Regards,
>>>>>>> > Gabor Kaszab
>>>>>>> >
>>>>>>> > John Zhuge <jzh...@apache.org> ezt írta (időpont: 2025. jún. 5.,
>>>>>>> Cs, 23:41):
>>>>>>> >>
>>>>>>> >> +1 Looking forward to this feature
>>>>>>> >>
>>>>>>> >> John Zhuge
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Thu, Jun 5, 2025 at 2:22 PM Ryan Blue <rdb...@gmail.com>
>>>>>>> wrote:
>>>>>>> >>>
>>>>>>> >>> > I think it does not make sense to stick manifest files to Avro
>>>>>>> if we break column stats into sub fields.
>>>>>>> >>>
>>>>>>> >>> This isn't necessarily true. Avro can benefit from better
>>>>>>> pushdown with Eduard's approach as well by being able to skip more
>>>>>>> efficiently. With the current layout, Avro stores a list of key/value 
>>>>>>> pairs
>>>>>>> that are all projected and put into a map. We avoid decoding the values,
>>>>>>> but each field ID is decoded, then the length of the value is decoded, 
>>>>>>> and
>>>>>>> finally there is a put operation with an ID and value ByteBuffer pair. 
>>>>>>> With
>>>>>>> the new approach, we will be able to know which fields are relevant and
>>>>>>> skip unprojected fields based on the file schema, which we couldn't do
>>>>>>> before.
>>>>>>> >>>
>>>>>>> >>> To skip stats for an unused field (not part of the filter),
>>>>>>> there are two cases. Lower/upper bounds for types that are fixed width 
>>>>>>> are
>>>>>>> skipped by updating the read position. And bounds for types that are
>>>>>>> variable length (strings and binary) are skipped by reading the length 
>>>>>>> and
>>>>>>> skipping that number of bytes.
>>>>>>> >>>
>>>>>>> >>> It turns out that actually producing the metric maps is a fairly
>>>>>>> expensive operation, so being able to skip metrics more quickly even if 
>>>>>>> the
>>>>>>> bytes still have to be read is going to save time. That said, using a
>>>>>>> columnar format is still going to be a good idea!
>>>>>>> >>>
>>>>>>> >>> On Wed, Jun 4, 2025 at 11:22 PM Gang Wu <ust...@gmail.com>
>>>>>>> wrote:
>>>>>>> >>>>
>>>>>>> >>>> > Together with the change which allows storing metadata in
>>>>>>> columnar formats
>>>>>>> >>>>
>>>>>>> >>>> +1 on this. I think it does not make sense to stick manifest
>>>>>>> files to Avro if we break column stats into sub fields.
>>>>>>> >>>>
>>>>>>> >>>> On Tue, Jun 3, 2025 at 7:19 PM Péter Váry <
>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>> >>>>>
>>>>>>> >>>>> I would love to see more flexibility in file stats. Together
>>>>>>> with the change which allows storing metadata in columnar formats will 
>>>>>>> open
>>>>>>> up many new possibilities. Bloom filters in metadata which could be used
>>>>>>> for filtering out files, HLL scratches etc....
>>>>>>> >>>>>
>>>>>>> >>>>> +1 for the change
>>>>>>> >>>>>
>>>>>>> >>>>> On Tue, Jun 3, 2025, 08:12 Szehon Ho <szehon.apa...@gmail.com>
>>>>>>> wrote:
>>>>>>> >>>>>>
>>>>>>> >>>>>> +1 , excited for this one too, we've seen the current metrics
>>>>>>> maps blow up the memory and hope can improve that.
>>>>>>> >>>>>>
>>>>>>> >>>>>> On the Geo front, this could allow us to add supplementary
>>>>>>> metrics that don't conform to the geo type, like S2 Cell Ids.
>>>>>>> >>>>>>
>>>>>>> >>>>>> Thanks
>>>>>>> >>>>>> Szehon
>>>>>>> >>>>>>
>>>>>>> >>>>>> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner <
>>>>>>> etudenhoef...@apache.org> wrote:
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Hey everyone,
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> I'm starting a thread to connect folks interested in
>>>>>>> improving the existing way of collecting column-level statistics (often
>>>>>>> referred to as metrics in the code). I've already started a proposal, 
>>>>>>> which
>>>>>>> can be found at https://s.apache.org/iceberg-column-stats.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Motivation
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Column statistics are currently stored as a mapping of field
>>>>>>> id to values across multiple columns (lower/upper bounds, value/nan/null
>>>>>>> counts, sizes). This storage model has critical limitations as the 
>>>>>>> number
>>>>>>> of columns increases and as new types are being added to Iceberg:
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Inefficient Storage due to map-based structure:
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Large memory overhead during planning/processing
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Inability to project specific stats (e.g., only
>>>>>>> null_value_counts for column X)
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Type Erasure: Original logical/physical types are lost when
>>>>>>> stored as binary blobs, causing:
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Lossy type inference during reads
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Schema evolution challenges (e.g., widening types)
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Rigid Schema: Stats are tied to the data_fil entry record,
>>>>>>> limiting extensibility for new stats.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Goals
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Improve the column stats representation to allow for the
>>>>>>> following:
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Projectability: Enable independent access to specific stats
>>>>>>> (e.g., lower_bounds without loading upper_bounds).
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Type Preservation: Store original data types to support
>>>>>>> accurate reads and schema evolution.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Flexible/Extensible Representation: Allow per-field stats
>>>>>>> structures (e.g., complex types like Geo/Variant).
>>>>>>> >>>>>>>
>>>>>>> >>>>>>>
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Thanks
>>>>>>> >>>>>>> Eduard
>>>>>>>
>>>>>>

Re: [DISCUSS] v4 - Improved column statistics

Reply via email to