Re: [DISCUSS] v4 - Improved column statistics

Gábor Kaszab Mon, 07 Jul 2025 06:22:05 -0700

+1 Seems a great improvement! Let me know if I can help out with
implementation, measurements, etc.!


Regards,
Gabor Kaszab

John Zhuge <jzh...@apache.org> ezt írta (időpont: 2025. jún. 5., Cs, 23:41):

> +1 Looking forward to this feature
>
> John Zhuge
>
>
> On Thu, Jun 5, 2025 at 2:22 PM Ryan Blue <rdb...@gmail.com> wrote:
>
>> > I think it does not make sense to stick manifest files to Avro if we
>> break column stats into sub fields.
>>
>> This isn't necessarily true. Avro can benefit from better pushdown with
>> Eduard's approach as well by being able to skip more efficiently. With the
>> current layout, Avro stores a list of key/value pairs that are all
>> projected and put into a map. We avoid decoding the values, but each field
>> ID is decoded, then the length of the value is decoded, and finally there
>> is a put operation with an ID and value ByteBuffer pair. With the new
>> approach, we will be able to know which fields are relevant and skip
>> unprojected fields based on the file schema, which we couldn't do before.
>>
>> To skip stats for an unused field (not part of the filter), there are two
>> cases. Lower/upper bounds for types that are fixed width are skipped by
>> updating the read position. And bounds for types that are variable length
>> (strings and binary) are skipped by reading the length and skipping that
>> number of bytes.
>>
>> It turns out that actually producing the metric maps is a fairly
>> expensive operation, so being able to skip metrics more quickly even if the
>> bytes still have to be read is going to save time. That said, using a
>> columnar format is still going to be a good idea!
>>
>> On Wed, Jun 4, 2025 at 11:22 PM Gang Wu <ust...@gmail.com> wrote:
>>
>>> > Together with the change which allows storing metadata in columnar
>>> formats
>>>
>>> +1 on this. I think it does not make sense to stick manifest files to
>>> Avro if we break column stats into sub fields.
>>>
>>> On Tue, Jun 3, 2025 at 7:19 PM Péter Váry <peter.vary.apa...@gmail.com>
>>> wrote:
>>>
>>>> I would love to see more flexibility in file stats. Together with the
>>>> change which allows storing metadata in columnar formats will open up many
>>>> new possibilities. Bloom filters in metadata which could be used for
>>>> filtering out files, HLL scratches etc....
>>>>
>>>> +1 for the change
>>>>
>>>> On Tue, Jun 3, 2025, 08:12 Szehon Ho <szehon.apa...@gmail.com> wrote:
>>>>
>>>>> +1 , excited for this one too, we've seen the current metrics maps
>>>>> blow up the memory and hope can improve that.
>>>>>
>>>>> On the Geo front, this could allow us to add supplementary metrics
>>>>> that don't conform to the geo type, like S2 Cell Ids.
>>>>>
>>>>> Thanks
>>>>> Szehon
>>>>>
>>>>> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner <
>>>>> etudenhoef...@apache.org> wrote:
>>>>>
>>>>>> Hey everyone,
>>>>>>
>>>>>> I'm starting a thread to connect folks interested in improving the
>>>>>> existing way of collecting column-level statistics (often referred to as
>>>>>> *metrics* in the code). I've already started a proposal, which can
>>>>>> be found at https://s.apache.org/iceberg-column-stats.
>>>>>>
>>>>>> *Motivation*
>>>>>>
>>>>>> Column statistics are currently stored as a mapping of field id to
>>>>>> values across multiple columns (lower/upper bounds, value/nan/null
>>>>>> counts, sizes). This storage model has critical limitations as the
>>>>>> number of columns increases and as new types are being added to Iceberg:
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    Inefficient Storage due to map-based structure:
>>>>>>    -
>>>>>>
>>>>>>       Large memory overhead during planning/processing
>>>>>>       -
>>>>>>
>>>>>>       Inability to project specific stats (e.g., only
>>>>>>       null_value_counts for column X)
>>>>>>       -
>>>>>>
>>>>>>    Type Erasure: Original logical/physical types are lost when
>>>>>>    stored as binary blobs, causing:
>>>>>>    -
>>>>>>
>>>>>>       Lossy type inference during reads
>>>>>>       - Schema evolution challenges (e.g., widening types)
>>>>>>    - Rigid Schema: Stats are tied to the data_fil entry record,
>>>>>>    limiting extensibility for new stats.
>>>>>>
>>>>>>
>>>>>> *Goals*
>>>>>>
>>>>>> Improve the column stats representation to allow for the following:
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    Projectability: Enable independent access to specific stats
>>>>>>    (e.g., lower_bounds without loading upper_bounds).
>>>>>>    -
>>>>>>
>>>>>>    Type Preservation: Store original data types to support accurate
>>>>>>    reads and schema evolution.
>>>>>>    -
>>>>>>
>>>>>>    Flexible/Extensible Representation: Allow per-field stats
>>>>>>    structures (e.g., complex types like Geo/Variant).
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Eduard
>>>>>>
>>>>>

Re: [DISCUSS] v4 - Improved column statistics

Reply via email to