Re: [DISCUSS] v4 - Improved column statistics

Péter Váry Wed, 23 Jul 2025 04:21:28 -0700

I also like the assigned field ids, my only concern is about the way of
generating them.


I have 2 main issues:

   1. The current proposal only leaves 10000+200 ids for other columns than
   stats. If in the future, we find some other feature which would require a
   manifest file column for every data column in the table, then we would need
   to change the spec.
   2. The current proposal expects every engine to share the same stats,
   and not store any "non-standard" stat in the metadata.

I think 1st could be solved by simply reserving a bigger number of fieldIds
whatever algorithm we choose.
I understand that the 2nd is not an immediate goal at this point, but I
don't see how we can support this in the future if we use only ids for
identifying the stat fields.


Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2025. júl. 22., K,
20:21):

> It seems reasonable to support stats for computed/calculated columns with
> assigned field ids.
>
> E.g., Flink has "computed columns" for a long time.
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/#columns
>
> CREATE TABLE MyTable (  `user_id` BIGINT,  `price` DOUBLE,  `quantity` 
> DOUBLE,  `cost` AS price * quantity  -- evaluate expression and supply the 
> result to queries) WITH (  'connector' = 'kafka'  ...);
>
>
> On Tue, Jul 22, 2025 at 10:56 AM Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> As long as enough folks are on board for assigning table field-id's to
>> non-physical columns I have no problem with that approach.
>>
>> On Tue, Jul 22, 2025 at 12:34 PM Ryan Blue <rdb...@gmail.com> wrote:
>>
>>> > I still think that the current `10000 + 200 * data_field_id` id
>>> generation logic is very limiting. I have already seen tables with more
>>> than 10k columns for AI space. And I don't think we would like to create
>>> hard limits here.
>>>
>>> If we include only positive integers for IDs then the max ID is 2^31-1 =
>>> 2,147,483,647. The last 200 fields are reserved and this is proposing to
>>> skip the first 10,000 IDs for other structures in the manifest file schema.
>>> That can accommodate more than 10.7 million field IDs from the table space
>>> (= (2,147,483,647 - 10,200) / 200). I think that is a reasonable upper
>>> bound for the number of table columns.
>>>
>>> > I think this would probably a good opportunity to also reserve space
>>> for metrics which apply to the Sort Order transformation used to create a
>>> file.
>>>
>>> I've also been thinking about being able to store stats for other
>>> expressions in this space, including partition values that are part of
>>> Amogh's one-file commit and adaptive metadata proposal. Another use case is
>>> keeping stats for derived values, like `to_lower(string_col)`. What I'd
>>> propose is being able to track expressions that are assigned an ID from the
>>> table column space, then storing the expressions stats according to the
>>> structure proposed here. But I wouldn't over-complicate the current
>>> proposal by adding this just yet. We can talk about extending stats to
>>> expressions in another proposal once we have the structure and ID
>>> assignment done.
>>>
>>> On Tue, Jul 22, 2025 at 9:59 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> I'm also sorry I missed the discussion because I was busy trying to
>>>> keep a nearly-3 year old occupied on a plane :)
>>>>
>>>> I think the proposal is pretty strong although I have one request.
>>>> Currently, we have the ability to note the sort order which
>>>> was applied to a data file but we have know way of knowing the
>>>> statistics of that applied transformation for said data file.
>>>> This has stopped us from doing a number of optimizations (like
>>>> combining files which are adjacent based on any complex
>>>> sort ordering). I think this would probably a good opportunity to also
>>>> reserve space for metrics which apply to the Sort Order
>>>> transformation used to create a file. This would still only be using
>>>> things that are already codified in the spec but
>>>> would make it possible for engines to use those transforms for further
>>>> predicate pushdown or optimization
>>>> in file compaction.
>>>>
>>>> A quick example
>>>>
>>>> Say I am using a hierarchical sort order (A, B, C)
>>>> I could then store the max, min of this transform which would be
>>>> independent to individual column maxes
>>>> Say
>>>> A : Min 1 A: Max 2
>>>> B: Min 1 B: Max 100000
>>>> C: Min 1 C: Max 100000
>>>>
>>>> A,B,C Min : (1,7000,32)
>>>> A,B,C Max : (2, 1, 100000)
>>>>
>>>> In this case if I'm looking for a record 2, 2, 4, I can instantly
>>>> reject using the sort order transform where if I was
>>>> using the individual columns I would have to read the file.
>>>>
>>>> This is of course also useful if the sort order is using some kind of
>>>> space filling curve or other clustering algorithm.
>>>>
>>>> Thanks for your hard work,
>>>> Russ
>>>>
>>>> On Thu, Jul 17, 2025 at 5:07 AM Péter Váry <peter.vary.apa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Team,
>>>>> Sorry, but I was not able to join the discussion on Tuesday :(, but I
>>>>> listened to the recording.
>>>>>
>>>>> A few thoughts:
>>>>> - I still think that the current `10000 + 200 * data_field_id` id
>>>>> generation logic is very limiting. I have already seen tables with more
>>>>> than 10k columns for AI space. And I don't think we would like to create
>>>>> hard limits here.
>>>>> - If we follow store a few things in metadata (starting_id,
>>>>> stats_types_present, columns_with_stats), then we can greatly reduce the
>>>>> used id space, and also allow for extensibility, as the engines which
>>>>> doesn't know a specific stats_type could just ignore them)
>>>>>
>>>>> Thanks Eduard and Dan for driving this!
>>>>> Peter
>>>>>
>>>>> Eduard Tudenhöfner <etudenhoef...@apache.org> ezt írta (időpont:
>>>>> 2025. júl. 16., Sze, 7:53):
>>>>>
>>>>>> Hey everyone,
>>>>>>
>>>>>> We met yesterday and talked about the column stats proposal.
>>>>>> Please find the recording here
>>>>>> <https://drive.google.com/file/d/1WVpSg9XxipO5NzogDc7D4DMsnj9cCMlF/view?usp=sharing>
>>>>>> and the notes here
>>>>>> <https://docs.google.com/document/d/1s9_o_Y8js4kHVCYI2OeL0Yh3Ey0qfkuE7Mm-z-QoEfo/edit?usp=sharing>
>>>>>> .
>>>>>>
>>>>>> Thanks everyone,
>>>>>> Eduard
>>>>>>
>>>>>> On Tue, Jul 8, 2025 at 6:51 PM Eduard Tudenhöfner <
>>>>>> etudenhoef...@apache.org> wrote:
>>>>>>
>>>>>>> Hey everyone,
>>>>>>>
>>>>>>> I've just added an event to the dev calendar for July 15 at 9am (PT)
>>>>>>> to discuss the column stats proposal.
>>>>>>>
>>>>>>>
>>>>>>> Eduard
>>>>>>>
>>>>>>> On Tue, Jul 8, 2025 at 4:09 AM Jacky Lee <qcsd2...@gmail.com> wrote:
>>>>>>>
>>>>>>>> +1 for the wonderful feature. Please count me in if you need any
>>>>>>>> help.
>>>>>>>>
>>>>>>>> Gábor Kaszab <gaborkas...@apache.org> 于2025年7月7日周一 21:22写道：
>>>>>>>> >
>>>>>>>> > +1 Seems a great improvement! Let me know if I can help out with
>>>>>>>> implementation, measurements, etc.!
>>>>>>>> >
>>>>>>>> > Regards,
>>>>>>>> > Gabor Kaszab
>>>>>>>> >
>>>>>>>> > John Zhuge <jzh...@apache.org> ezt írta (időpont: 2025. jún. 5.,
>>>>>>>> Cs, 23:41):
>>>>>>>> >>
>>>>>>>> >> +1 Looking forward to this feature
>>>>>>>> >>
>>>>>>>> >> John Zhuge
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> On Thu, Jun 5, 2025 at 2:22 PM Ryan Blue <rdb...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> >>>
>>>>>>>> >>> > I think it does not make sense to stick manifest files to
>>>>>>>> Avro if we break column stats into sub fields.
>>>>>>>> >>>
>>>>>>>> >>> This isn't necessarily true. Avro can benefit from better
>>>>>>>> pushdown with Eduard's approach as well by being able to skip more
>>>>>>>> efficiently. With the current layout, Avro stores a list of key/value 
>>>>>>>> pairs
>>>>>>>> that are all projected and put into a map. We avoid decoding the 
>>>>>>>> values,
>>>>>>>> but each field ID is decoded, then the length of the value is decoded, 
>>>>>>>> and
>>>>>>>> finally there is a put operation with an ID and value ByteBuffer pair. 
>>>>>>>> With
>>>>>>>> the new approach, we will be able to know which fields are relevant and
>>>>>>>> skip unprojected fields based on the file schema, which we couldn't do
>>>>>>>> before.
>>>>>>>> >>>
>>>>>>>> >>> To skip stats for an unused field (not part of the filter),
>>>>>>>> there are two cases. Lower/upper bounds for types that are fixed width 
>>>>>>>> are
>>>>>>>> skipped by updating the read position. And bounds for types that are
>>>>>>>> variable length (strings and binary) are skipped by reading the length 
>>>>>>>> and
>>>>>>>> skipping that number of bytes.
>>>>>>>> >>>
>>>>>>>> >>> It turns out that actually producing the metric maps is a
>>>>>>>> fairly expensive operation, so being able to skip metrics more quickly 
>>>>>>>> even
>>>>>>>> if the bytes still have to be read is going to save time. That said, 
>>>>>>>> using
>>>>>>>> a columnar format is still going to be a good idea!
>>>>>>>> >>>
>>>>>>>> >>> On Wed, Jun 4, 2025 at 11:22 PM Gang Wu <ust...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> >>>>
>>>>>>>> >>>> > Together with the change which allows storing metadata in
>>>>>>>> columnar formats
>>>>>>>> >>>>
>>>>>>>> >>>> +1 on this. I think it does not make sense to stick manifest
>>>>>>>> files to Avro if we break column stats into sub fields.
>>>>>>>> >>>>
>>>>>>>> >>>> On Tue, Jun 3, 2025 at 7:19 PM Péter Váry <
>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>> >>>>>
>>>>>>>> >>>>> I would love to see more flexibility in file stats. Together
>>>>>>>> with the change which allows storing metadata in columnar formats will 
>>>>>>>> open
>>>>>>>> up many new possibilities. Bloom filters in metadata which could be 
>>>>>>>> used
>>>>>>>> for filtering out files, HLL scratches etc....
>>>>>>>> >>>>>
>>>>>>>> >>>>> +1 for the change
>>>>>>>> >>>>>
>>>>>>>> >>>>> On Tue, Jun 3, 2025, 08:12 Szehon Ho <szehon.apa...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> +1 , excited for this one too, we've seen the current
>>>>>>>> metrics maps blow up the memory and hope can improve that.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> On the Geo front, this could allow us to add supplementary
>>>>>>>> metrics that don't conform to the geo type, like S2 Cell Ids.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> Thanks
>>>>>>>> >>>>>> Szehon
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner <
>>>>>>>> etudenhoef...@apache.org> wrote:
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Hey everyone,
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> I'm starting a thread to connect folks interested in
>>>>>>>> improving the existing way of collecting column-level statistics (often
>>>>>>>> referred to as metrics in the code). I've already started a proposal, 
>>>>>>>> which
>>>>>>>> can be found at https://s.apache.org/iceberg-column-stats.
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Motivation
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Column statistics are currently stored as a mapping of
>>>>>>>> field id to values across multiple columns (lower/upper bounds,
>>>>>>>> value/nan/null counts, sizes). This storage model has critical 
>>>>>>>> limitations
>>>>>>>> as the number of columns increases and as new types are being added to
>>>>>>>> Iceberg:
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Inefficient Storage due to map-based structure:
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Large memory overhead during planning/processing
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Inability to project specific stats (e.g., only
>>>>>>>> null_value_counts for column X)
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Type Erasure: Original logical/physical types are lost when
>>>>>>>> stored as binary blobs, causing:
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Lossy type inference during reads
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Schema evolution challenges (e.g., widening types)
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Rigid Schema: Stats are tied to the data_fil entry record,
>>>>>>>> limiting extensibility for new stats.
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Goals
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Improve the column stats representation to allow for the
>>>>>>>> following:
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Projectability: Enable independent access to specific stats
>>>>>>>> (e.g., lower_bounds without loading upper_bounds).
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Type Preservation: Store original data types to support
>>>>>>>> accurate reads and schema evolution.
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Flexible/Extensible Representation: Allow per-field stats
>>>>>>>> structures (e.g., complex types like Geo/Variant).
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Thanks
>>>>>>>> >>>>>>> Eduard
>>>>>>>>
>>>>>>>

Re: [DISCUSS] v4 - Improved column statistics

Reply via email to