Re: [DISCUSS] v4 - Improved column statistics

Jacky Lee Mon, 07 Jul 2025 19:09:48 -0700

+1 for the wonderful feature. Please count me in if you need any help.


Gábor Kaszab <gaborkas...@apache.org> 于2025年7月7日周一 21:22写道：
>
> +1 Seems a great improvement! Let me know if I can help out with 
> implementation, measurements, etc.!
>
> Regards,
> Gabor Kaszab
>
> John Zhuge <jzh...@apache.org> ezt írta (időpont: 2025. jún. 5., Cs, 23:41):
>>
>> +1 Looking forward to this feature
>>
>> John Zhuge
>>
>>
>> On Thu, Jun 5, 2025 at 2:22 PM Ryan Blue <rdb...@gmail.com> wrote:
>>>
>>> > I think it does not make sense to stick manifest files to Avro if we 
>>> > break column stats into sub fields.
>>>
>>> This isn't necessarily true. Avro can benefit from better pushdown with 
>>> Eduard's approach as well by being able to skip more efficiently. With the 
>>> current layout, Avro stores a list of key/value pairs that are all 
>>> projected and put into a map. We avoid decoding the values, but each field 
>>> ID is decoded, then the length of the value is decoded, and finally there 
>>> is a put operation with an ID and value ByteBuffer pair. With the new 
>>> approach, we will be able to know which fields are relevant and skip 
>>> unprojected fields based on the file schema, which we couldn't do before.
>>>
>>> To skip stats for an unused field (not part of the filter), there are two 
>>> cases. Lower/upper bounds for types that are fixed width are skipped by 
>>> updating the read position. And bounds for types that are variable length 
>>> (strings and binary) are skipped by reading the length and skipping that 
>>> number of bytes.
>>>
>>> It turns out that actually producing the metric maps is a fairly expensive 
>>> operation, so being able to skip metrics more quickly even if the bytes 
>>> still have to be read is going to save time. That said, using a columnar 
>>> format is still going to be a good idea!
>>>
>>> On Wed, Jun 4, 2025 at 11:22 PM Gang Wu <ust...@gmail.com> wrote:
>>>>
>>>> > Together with the change which allows storing metadata in columnar 
>>>> > formats
>>>>
>>>> +1 on this. I think it does not make sense to stick manifest files to Avro 
>>>> if we break column stats into sub fields.
>>>>
>>>> On Tue, Jun 3, 2025 at 7:19 PM Péter Váry <peter.vary.apa...@gmail.com> 
>>>> wrote:
>>>>>
>>>>> I would love to see more flexibility in file stats. Together with the 
>>>>> change which allows storing metadata in columnar formats will open up 
>>>>> many new possibilities. Bloom filters in metadata which could be used for 
>>>>> filtering out files, HLL scratches etc....
>>>>>
>>>>> +1 for the change
>>>>>
>>>>> On Tue, Jun 3, 2025, 08:12 Szehon Ho <szehon.apa...@gmail.com> wrote:
>>>>>>
>>>>>> +1 , excited for this one too, we've seen the current metrics maps blow 
>>>>>> up the memory and hope can improve that.
>>>>>>
>>>>>> On the Geo front, this could allow us to add supplementary metrics that 
>>>>>> don't conform to the geo type, like S2 Cell Ids.
>>>>>>
>>>>>> Thanks
>>>>>> Szehon
>>>>>>
>>>>>> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner 
>>>>>> <etudenhoef...@apache.org> wrote:
>>>>>>>
>>>>>>> Hey everyone,
>>>>>>>
>>>>>>> I'm starting a thread to connect folks interested in improving the 
>>>>>>> existing way of collecting column-level statistics (often referred to 
>>>>>>> as metrics in the code). I've already started a proposal, which can be 
>>>>>>> found at https://s.apache.org/iceberg-column-stats.
>>>>>>>
>>>>>>> Motivation
>>>>>>>
>>>>>>> Column statistics are currently stored as a mapping of field id to 
>>>>>>> values across multiple columns (lower/upper bounds, value/nan/null 
>>>>>>> counts, sizes). This storage model has critical limitations as the 
>>>>>>> number of columns increases and as new types are being added to Iceberg:
>>>>>>>
>>>>>>> Inefficient Storage due to map-based structure:
>>>>>>>
>>>>>>> Large memory overhead during planning/processing
>>>>>>>
>>>>>>> Inability to project specific stats (e.g., only null_value_counts for 
>>>>>>> column X)
>>>>>>>
>>>>>>> Type Erasure: Original logical/physical types are lost when stored as 
>>>>>>> binary blobs, causing:
>>>>>>>
>>>>>>> Lossy type inference during reads
>>>>>>>
>>>>>>> Schema evolution challenges (e.g., widening types)
>>>>>>>
>>>>>>> Rigid Schema: Stats are tied to the data_fil entry record, limiting 
>>>>>>> extensibility for new stats.
>>>>>>>
>>>>>>>
>>>>>>> Goals
>>>>>>>
>>>>>>> Improve the column stats representation to allow for the following:
>>>>>>>
>>>>>>> Projectability: Enable independent access to specific stats (e.g., 
>>>>>>> lower_bounds without loading upper_bounds).
>>>>>>>
>>>>>>> Type Preservation: Store original data types to support accurate reads 
>>>>>>> and schema evolution.
>>>>>>>
>>>>>>> Flexible/Extensible Representation: Allow per-field stats structures 
>>>>>>> (e.g., complex types like Geo/Variant).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>> Eduard

Re: [DISCUSS] v4 - Improved column statistics

Reply via email to