Re: [DISCUSS] v4 - Improved column statistics

Ryan Blue Tue, 22 Jul 2025 10:34:30 -0700

> I still think that the current `10000 + 200 * data_field_id` id
generation logic is very limiting. I have already seen tables with more
than 10k columns for AI space. And I don't think we would like to create
hard limits here.


If we include only positive integers for IDs then the max ID is 2^31-1 =
2,147,483,647. The last 200 fields are reserved and this is proposing to
skip the first 10,000 IDs for other structures in the manifest file schema.
That can accommodate more than 10.7 million field IDs from the table space
(= (2,147,483,647 - 10,200) / 200). I think that is a reasonable upper
bound for the number of table columns.

> I think this would probably a good opportunity to also reserve space for
metrics which apply to the Sort Order transformation used to create a file.

I've also been thinking about being able to store stats for other
expressions in this space, including partition values that are part of
Amogh's one-file commit and adaptive metadata proposal. Another use case is
keeping stats for derived values, like `to_lower(string_col)`. What I'd
propose is being able to track expressions that are assigned an ID from the
table column space, then storing the expressions stats according to the
structure proposed here. But I wouldn't over-complicate the current
proposal by adding this just yet. We can talk about extending stats to
expressions in another proposal once we have the structure and ID
assignment done.

On Tue, Jul 22, 2025 at 9:59 AM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> I'm also sorry I missed the discussion because I was busy trying to keep a
> nearly-3 year old occupied on a plane :)
>
> I think the proposal is pretty strong although I have one request.
> Currently, we have the ability to note the sort order which
> was applied to a data file but we have know way of knowing the statistics
> of that applied transformation for said data file.
> This has stopped us from doing a number of optimizations (like combining
> files which are adjacent based on any complex
> sort ordering). I think this would probably a good opportunity to also
> reserve space for metrics which apply to the Sort Order
> transformation used to create a file. This would still only be using
> things that are already codified in the spec but
> would make it possible for engines to use those transforms for further
> predicate pushdown or optimization
> in file compaction.
>
> A quick example
>
> Say I am using a hierarchical sort order (A, B, C)
> I could then store the max, min of this transform which would be
> independent to individual column maxes
> Say
> A : Min 1 A: Max 2
> B: Min 1 B: Max 100000
> C: Min 1 C: Max 100000
>
> A,B,C Min : (1,7000,32)
> A,B,C Max : (2, 1, 100000)
>
> In this case if I'm looking for a record 2, 2, 4, I can instantly reject
> using the sort order transform where if I was
> using the individual columns I would have to read the file.
>
> This is of course also useful if the sort order is using some kind of
> space filling curve or other clustering algorithm.
>
> Thanks for your hard work,
> Russ
>
> On Thu, Jul 17, 2025 at 5:07 AM Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
>
>> Hi Team,
>> Sorry, but I was not able to join the discussion on Tuesday :(, but I
>> listened to the recording.
>>
>> A few thoughts:
>> - I still think that the current `10000 + 200 * data_field_id` id
>> generation logic is very limiting. I have already seen tables with more
>> than 10k columns for AI space. And I don't think we would like to create
>> hard limits here.
>> - If we follow store a few things in metadata (starting_id,
>> stats_types_present, columns_with_stats), then we can greatly reduce the
>> used id space, and also allow for extensibility, as the engines which
>> doesn't know a specific stats_type could just ignore them)
>>
>> Thanks Eduard and Dan for driving this!
>> Peter
>>
>> Eduard Tudenhöfner <etudenhoef...@apache.org> ezt írta (időpont: 2025.
>> júl. 16., Sze, 7:53):
>>
>>> Hey everyone,
>>>
>>> We met yesterday and talked about the column stats proposal.
>>> Please find the recording here
>>> <https://drive.google.com/file/d/1WVpSg9XxipO5NzogDc7D4DMsnj9cCMlF/view?usp=sharing>
>>> and the notes here
>>> <https://docs.google.com/document/d/1s9_o_Y8js4kHVCYI2OeL0Yh3Ey0qfkuE7Mm-z-QoEfo/edit?usp=sharing>
>>> .
>>>
>>> Thanks everyone,
>>> Eduard
>>>
>>> On Tue, Jul 8, 2025 at 6:51 PM Eduard Tudenhöfner <
>>> etudenhoef...@apache.org> wrote:
>>>
>>>> Hey everyone,
>>>>
>>>> I've just added an event to the dev calendar for July 15 at 9am (PT) to
>>>> discuss the column stats proposal.
>>>>
>>>>
>>>> Eduard
>>>>
>>>> On Tue, Jul 8, 2025 at 4:09 AM Jacky Lee <qcsd2...@gmail.com> wrote:
>>>>
>>>>> +1 for the wonderful feature. Please count me in if you need any help.
>>>>>
>>>>> Gábor Kaszab <gaborkas...@apache.org> 于2025年7月7日周一 21:22写道：
>>>>> >
>>>>> > +1 Seems a great improvement! Let me know if I can help out with
>>>>> implementation, measurements, etc.!
>>>>> >
>>>>> > Regards,
>>>>> > Gabor Kaszab
>>>>> >
>>>>> > John Zhuge <jzh...@apache.org> ezt írta (időpont: 2025. jún. 5.,
>>>>> Cs, 23:41):
>>>>> >>
>>>>> >> +1 Looking forward to this feature
>>>>> >>
>>>>> >> John Zhuge
>>>>> >>
>>>>> >>
>>>>> >> On Thu, Jun 5, 2025 at 2:22 PM Ryan Blue <rdb...@gmail.com> wrote:
>>>>> >>>
>>>>> >>> > I think it does not make sense to stick manifest files to Avro
>>>>> if we break column stats into sub fields.
>>>>> >>>
>>>>> >>> This isn't necessarily true. Avro can benefit from better pushdown
>>>>> with Eduard's approach as well by being able to skip more efficiently. 
>>>>> With
>>>>> the current layout, Avro stores a list of key/value pairs that are all
>>>>> projected and put into a map. We avoid decoding the values, but each field
>>>>> ID is decoded, then the length of the value is decoded, and finally there
>>>>> is a put operation with an ID and value ByteBuffer pair. With the new
>>>>> approach, we will be able to know which fields are relevant and skip
>>>>> unprojected fields based on the file schema, which we couldn't do before.
>>>>> >>>
>>>>> >>> To skip stats for an unused field (not part of the filter), there
>>>>> are two cases. Lower/upper bounds for types that are fixed width are
>>>>> skipped by updating the read position. And bounds for types that are
>>>>> variable length (strings and binary) are skipped by reading the length and
>>>>> skipping that number of bytes.
>>>>> >>>
>>>>> >>> It turns out that actually producing the metric maps is a fairly
>>>>> expensive operation, so being able to skip metrics more quickly even if 
>>>>> the
>>>>> bytes still have to be read is going to save time. That said, using a
>>>>> columnar format is still going to be a good idea!
>>>>> >>>
>>>>> >>> On Wed, Jun 4, 2025 at 11:22 PM Gang Wu <ust...@gmail.com> wrote:
>>>>> >>>>
>>>>> >>>> > Together with the change which allows storing metadata in
>>>>> columnar formats
>>>>> >>>>
>>>>> >>>> +1 on this. I think it does not make sense to stick manifest
>>>>> files to Avro if we break column stats into sub fields.
>>>>> >>>>
>>>>> >>>> On Tue, Jun 3, 2025 at 7:19 PM Péter Váry <
>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>> >>>>>
>>>>> >>>>> I would love to see more flexibility in file stats. Together
>>>>> with the change which allows storing metadata in columnar formats will 
>>>>> open
>>>>> up many new possibilities. Bloom filters in metadata which could be used
>>>>> for filtering out files, HLL scratches etc....
>>>>> >>>>>
>>>>> >>>>> +1 for the change
>>>>> >>>>>
>>>>> >>>>> On Tue, Jun 3, 2025, 08:12 Szehon Ho <szehon.apa...@gmail.com>
>>>>> wrote:
>>>>> >>>>>>
>>>>> >>>>>> +1 , excited for this one too, we've seen the current metrics
>>>>> maps blow up the memory and hope can improve that.
>>>>> >>>>>>
>>>>> >>>>>> On the Geo front, this could allow us to add supplementary
>>>>> metrics that don't conform to the geo type, like S2 Cell Ids.
>>>>> >>>>>>
>>>>> >>>>>> Thanks
>>>>> >>>>>> Szehon
>>>>> >>>>>>
>>>>> >>>>>> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner <
>>>>> etudenhoef...@apache.org> wrote:
>>>>> >>>>>>>
>>>>> >>>>>>> Hey everyone,
>>>>> >>>>>>>
>>>>> >>>>>>> I'm starting a thread to connect folks interested in improving
>>>>> the existing way of collecting column-level statistics (often referred to
>>>>> as metrics in the code). I've already started a proposal, which can be
>>>>> found at https://s.apache.org/iceberg-column-stats.
>>>>> >>>>>>>
>>>>> >>>>>>> Motivation
>>>>> >>>>>>>
>>>>> >>>>>>> Column statistics are currently stored as a mapping of field
>>>>> id to values across multiple columns (lower/upper bounds, value/nan/null
>>>>> counts, sizes). This storage model has critical limitations as the number
>>>>> of columns increases and as new types are being added to Iceberg:
>>>>> >>>>>>>
>>>>> >>>>>>> Inefficient Storage due to map-based structure:
>>>>> >>>>>>>
>>>>> >>>>>>> Large memory overhead during planning/processing
>>>>> >>>>>>>
>>>>> >>>>>>> Inability to project specific stats (e.g., only
>>>>> null_value_counts for column X)
>>>>> >>>>>>>
>>>>> >>>>>>> Type Erasure: Original logical/physical types are lost when
>>>>> stored as binary blobs, causing:
>>>>> >>>>>>>
>>>>> >>>>>>> Lossy type inference during reads
>>>>> >>>>>>>
>>>>> >>>>>>> Schema evolution challenges (e.g., widening types)
>>>>> >>>>>>>
>>>>> >>>>>>> Rigid Schema: Stats are tied to the data_fil entry record,
>>>>> limiting extensibility for new stats.
>>>>> >>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>> Goals
>>>>> >>>>>>>
>>>>> >>>>>>> Improve the column stats representation to allow for the
>>>>> following:
>>>>> >>>>>>>
>>>>> >>>>>>> Projectability: Enable independent access to specific stats
>>>>> (e.g., lower_bounds without loading upper_bounds).
>>>>> >>>>>>>
>>>>> >>>>>>> Type Preservation: Store original data types to support
>>>>> accurate reads and schema evolution.
>>>>> >>>>>>>
>>>>> >>>>>>> Flexible/Extensible Representation: Allow per-field stats
>>>>> structures (e.g., complex types like Geo/Variant).
>>>>> >>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>> Thanks
>>>>> >>>>>>> Eduard
>>>>>
>>>>

Re: [DISCUSS] v4 - Improved column statistics

Reply via email to