Re: [DISCUSS] v4 - Improved column statistics

Russell Spitzer Tue, 22 Jul 2025 10:56:38 -0700

As long as enough folks are on board for assigning table field-id's to
non-physical columns I have no problem with that approach.


On Tue, Jul 22, 2025 at 12:34 PM Ryan Blue <[email protected]> wrote:

> > I still think that the current `10000 + 200 * data_field_id` id
> generation logic is very limiting. I have already seen tables with more
> than 10k columns for AI space. And I don't think we would like to create
> hard limits here.
>
> If we include only positive integers for IDs then the max ID is 2^31-1 =
> 2,147,483,647. The last 200 fields are reserved and this is proposing to
> skip the first 10,000 IDs for other structures in the manifest file schema.
> That can accommodate more than 10.7 million field IDs from the table space
> (= (2,147,483,647 - 10,200) / 200). I think that is a reasonable upper
> bound for the number of table columns.
>
> > I think this would probably a good opportunity to also reserve space for
> metrics which apply to the Sort Order transformation used to create a file.
>
> I've also been thinking about being able to store stats for other
> expressions in this space, including partition values that are part of
> Amogh's one-file commit and adaptive metadata proposal. Another use case is
> keeping stats for derived values, like `to_lower(string_col)`. What I'd
> propose is being able to track expressions that are assigned an ID from the
> table column space, then storing the expressions stats according to the
> structure proposed here. But I wouldn't over-complicate the current
> proposal by adding this just yet. We can talk about extending stats to
> expressions in another proposal once we have the structure and ID
> assignment done.
>
> On Tue, Jul 22, 2025 at 9:59 AM Russell Spitzer <[email protected]>
> wrote:
>
>> I'm also sorry I missed the discussion because I was busy trying to keep
>> a nearly-3 year old occupied on a plane :)
>>
>> I think the proposal is pretty strong although I have one request.
>> Currently, we have the ability to note the sort order which
>> was applied to a data file but we have know way of knowing the statistics
>> of that applied transformation for said data file.
>> This has stopped us from doing a number of optimizations (like combining
>> files which are adjacent based on any complex
>> sort ordering). I think this would probably a good opportunity to also
>> reserve space for metrics which apply to the Sort Order
>> transformation used to create a file. This would still only be using
>> things that are already codified in the spec but
>> would make it possible for engines to use those transforms for further
>> predicate pushdown or optimization
>> in file compaction.
>>
>> A quick example
>>
>> Say I am using a hierarchical sort order (A, B, C)
>> I could then store the max, min of this transform which would be
>> independent to individual column maxes
>> Say
>> A : Min 1 A: Max 2
>> B: Min 1 B: Max 100000
>> C: Min 1 C: Max 100000
>>
>> A,B,C Min : (1,7000,32)
>> A,B,C Max : (2, 1, 100000)
>>
>> In this case if I'm looking for a record 2, 2, 4, I can instantly reject
>> using the sort order transform where if I was
>> using the individual columns I would have to read the file.
>>
>> This is of course also useful if the sort order is using some kind of
>> space filling curve or other clustering algorithm.
>>
>> Thanks for your hard work,
>> Russ
>>
>> On Thu, Jul 17, 2025 at 5:07 AM Péter Váry <[email protected]>
>> wrote:
>>
>>> Hi Team,
>>> Sorry, but I was not able to join the discussion on Tuesday :(, but I
>>> listened to the recording.
>>>
>>> A few thoughts:
>>> - I still think that the current `10000 + 200 * data_field_id` id
>>> generation logic is very limiting. I have already seen tables with more
>>> than 10k columns for AI space. And I don't think we would like to create
>>> hard limits here.
>>> - If we follow store a few things in metadata (starting_id,
>>> stats_types_present, columns_with_stats), then we can greatly reduce the
>>> used id space, and also allow for extensibility, as the engines which
>>> doesn't know a specific stats_type could just ignore them)
>>>
>>> Thanks Eduard and Dan for driving this!
>>> Peter
>>>
>>> Eduard Tudenhöfner <[email protected]> ezt írta (időpont: 2025.
>>> júl. 16., Sze, 7:53):
>>>
>>>> Hey everyone,
>>>>
>>>> We met yesterday and talked about the column stats proposal.
>>>> Please find the recording here
>>>> <https://drive.google.com/file/d/1WVpSg9XxipO5NzogDc7D4DMsnj9cCMlF/view?usp=sharing>
>>>> and the notes here
>>>> <https://docs.google.com/document/d/1s9_o_Y8js4kHVCYI2OeL0Yh3Ey0qfkuE7Mm-z-QoEfo/edit?usp=sharing>
>>>> .
>>>>
>>>> Thanks everyone,
>>>> Eduard
>>>>
>>>> On Tue, Jul 8, 2025 at 6:51 PM Eduard Tudenhöfner <
>>>> [email protected]> wrote:
>>>>
>>>>> Hey everyone,
>>>>>
>>>>> I've just added an event to the dev calendar for July 15 at 9am (PT)
>>>>> to discuss the column stats proposal.
>>>>>
>>>>>
>>>>> Eduard
>>>>>
>>>>> On Tue, Jul 8, 2025 at 4:09 AM Jacky Lee <[email protected]> wrote:
>>>>>
>>>>>> +1 for the wonderful feature. Please count me in if you need any help.
>>>>>>
>>>>>> Gábor Kaszab <[email protected]> 于2025年7月7日周一 21:22写道：
>>>>>> >
>>>>>> > +1 Seems a great improvement! Let me know if I can help out with
>>>>>> implementation, measurements, etc.!
>>>>>> >
>>>>>> > Regards,
>>>>>> > Gabor Kaszab
>>>>>> >
>>>>>> > John Zhuge <[email protected]> ezt írta (időpont: 2025. jún. 5.,
>>>>>> Cs, 23:41):
>>>>>> >>
>>>>>> >> +1 Looking forward to this feature
>>>>>> >>
>>>>>> >> John Zhuge
>>>>>> >>
>>>>>> >>
>>>>>> >> On Thu, Jun 5, 2025 at 2:22 PM Ryan Blue <[email protected]> wrote:
>>>>>> >>>
>>>>>> >>> > I think it does not make sense to stick manifest files to Avro
>>>>>> if we break column stats into sub fields.
>>>>>> >>>
>>>>>> >>> This isn't necessarily true. Avro can benefit from better
>>>>>> pushdown with Eduard's approach as well by being able to skip more
>>>>>> efficiently. With the current layout, Avro stores a list of key/value 
>>>>>> pairs
>>>>>> that are all projected and put into a map. We avoid decoding the values,
>>>>>> but each field ID is decoded, then the length of the value is decoded, 
>>>>>> and
>>>>>> finally there is a put operation with an ID and value ByteBuffer pair. 
>>>>>> With
>>>>>> the new approach, we will be able to know which fields are relevant and
>>>>>> skip unprojected fields based on the file schema, which we couldn't do
>>>>>> before.
>>>>>> >>>
>>>>>> >>> To skip stats for an unused field (not part of the filter), there
>>>>>> are two cases. Lower/upper bounds for types that are fixed width are
>>>>>> skipped by updating the read position. And bounds for types that are
>>>>>> variable length (strings and binary) are skipped by reading the length 
>>>>>> and
>>>>>> skipping that number of bytes.
>>>>>> >>>
>>>>>> >>> It turns out that actually producing the metric maps is a fairly
>>>>>> expensive operation, so being able to skip metrics more quickly even if 
>>>>>> the
>>>>>> bytes still have to be read is going to save time. That said, using a
>>>>>> columnar format is still going to be a good idea!
>>>>>> >>>
>>>>>> >>> On Wed, Jun 4, 2025 at 11:22 PM Gang Wu <[email protected]> wrote:
>>>>>> >>>>
>>>>>> >>>> > Together with the change which allows storing metadata in
>>>>>> columnar formats
>>>>>> >>>>
>>>>>> >>>> +1 on this. I think it does not make sense to stick manifest
>>>>>> files to Avro if we break column stats into sub fields.
>>>>>> >>>>
>>>>>> >>>> On Tue, Jun 3, 2025 at 7:19 PM Péter Váry <
>>>>>> [email protected]> wrote:
>>>>>> >>>>>
>>>>>> >>>>> I would love to see more flexibility in file stats. Together
>>>>>> with the change which allows storing metadata in columnar formats will 
>>>>>> open
>>>>>> up many new possibilities. Bloom filters in metadata which could be used
>>>>>> for filtering out files, HLL scratches etc....
>>>>>> >>>>>
>>>>>> >>>>> +1 for the change
>>>>>> >>>>>
>>>>>> >>>>> On Tue, Jun 3, 2025, 08:12 Szehon Ho <[email protected]>
>>>>>> wrote:
>>>>>> >>>>>>
>>>>>> >>>>>> +1 , excited for this one too, we've seen the current metrics
>>>>>> maps blow up the memory and hope can improve that.
>>>>>> >>>>>>
>>>>>> >>>>>> On the Geo front, this could allow us to add supplementary
>>>>>> metrics that don't conform to the geo type, like S2 Cell Ids.
>>>>>> >>>>>>
>>>>>> >>>>>> Thanks
>>>>>> >>>>>> Szehon
>>>>>> >>>>>>
>>>>>> >>>>>> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner <
>>>>>> [email protected]> wrote:
>>>>>> >>>>>>>
>>>>>> >>>>>>> Hey everyone,
>>>>>> >>>>>>>
>>>>>> >>>>>>> I'm starting a thread to connect folks interested in
>>>>>> improving the existing way of collecting column-level statistics (often
>>>>>> referred to as metrics in the code). I've already started a proposal, 
>>>>>> which
>>>>>> can be found at https://s.apache.org/iceberg-column-stats.
>>>>>> >>>>>>>
>>>>>> >>>>>>> Motivation
>>>>>> >>>>>>>
>>>>>> >>>>>>> Column statistics are currently stored as a mapping of field
>>>>>> id to values across multiple columns (lower/upper bounds, value/nan/null
>>>>>> counts, sizes). This storage model has critical limitations as the number
>>>>>> of columns increases and as new types are being added to Iceberg:
>>>>>> >>>>>>>
>>>>>> >>>>>>> Inefficient Storage due to map-based structure:
>>>>>> >>>>>>>
>>>>>> >>>>>>> Large memory overhead during planning/processing
>>>>>> >>>>>>>
>>>>>> >>>>>>> Inability to project specific stats (e.g., only
>>>>>> null_value_counts for column X)
>>>>>> >>>>>>>
>>>>>> >>>>>>> Type Erasure: Original logical/physical types are lost when
>>>>>> stored as binary blobs, causing:
>>>>>> >>>>>>>
>>>>>> >>>>>>> Lossy type inference during reads
>>>>>> >>>>>>>
>>>>>> >>>>>>> Schema evolution challenges (e.g., widening types)
>>>>>> >>>>>>>
>>>>>> >>>>>>> Rigid Schema: Stats are tied to the data_fil entry record,
>>>>>> limiting extensibility for new stats.
>>>>>> >>>>>>>
>>>>>> >>>>>>>
>>>>>> >>>>>>> Goals
>>>>>> >>>>>>>
>>>>>> >>>>>>> Improve the column stats representation to allow for the
>>>>>> following:
>>>>>> >>>>>>>
>>>>>> >>>>>>> Projectability: Enable independent access to specific stats
>>>>>> (e.g., lower_bounds without loading upper_bounds).
>>>>>> >>>>>>>
>>>>>> >>>>>>> Type Preservation: Store original data types to support
>>>>>> accurate reads and schema evolution.
>>>>>> >>>>>>>
>>>>>> >>>>>>> Flexible/Extensible Representation: Allow per-field stats
>>>>>> structures (e.g., complex types like Geo/Variant).
>>>>>> >>>>>>>
>>>>>> >>>>>>>
>>>>>> >>>>>>>
>>>>>> >>>>>>> Thanks
>>>>>> >>>>>>> Eduard
>>>>>>
>>>>>

Re: [DISCUSS] v4 - Improved column statistics

Reply via email to