+1 Seems a great improvement! Let me know if I can help out with implementation, measurements, etc.!
Regards, Gabor Kaszab John Zhuge <jzh...@apache.org> ezt írta (időpont: 2025. jún. 5., Cs, 23:41): > +1 Looking forward to this feature > > John Zhuge > > > On Thu, Jun 5, 2025 at 2:22 PM Ryan Blue <rdb...@gmail.com> wrote: > >> > I think it does not make sense to stick manifest files to Avro if we >> break column stats into sub fields. >> >> This isn't necessarily true. Avro can benefit from better pushdown with >> Eduard's approach as well by being able to skip more efficiently. With the >> current layout, Avro stores a list of key/value pairs that are all >> projected and put into a map. We avoid decoding the values, but each field >> ID is decoded, then the length of the value is decoded, and finally there >> is a put operation with an ID and value ByteBuffer pair. With the new >> approach, we will be able to know which fields are relevant and skip >> unprojected fields based on the file schema, which we couldn't do before. >> >> To skip stats for an unused field (not part of the filter), there are two >> cases. Lower/upper bounds for types that are fixed width are skipped by >> updating the read position. And bounds for types that are variable length >> (strings and binary) are skipped by reading the length and skipping that >> number of bytes. >> >> It turns out that actually producing the metric maps is a fairly >> expensive operation, so being able to skip metrics more quickly even if the >> bytes still have to be read is going to save time. That said, using a >> columnar format is still going to be a good idea! >> >> On Wed, Jun 4, 2025 at 11:22 PM Gang Wu <ust...@gmail.com> wrote: >> >>> > Together with the change which allows storing metadata in columnar >>> formats >>> >>> +1 on this. I think it does not make sense to stick manifest files to >>> Avro if we break column stats into sub fields. >>> >>> On Tue, Jun 3, 2025 at 7:19 PM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> I would love to see more flexibility in file stats. Together with the >>>> change which allows storing metadata in columnar formats will open up many >>>> new possibilities. Bloom filters in metadata which could be used for >>>> filtering out files, HLL scratches etc.... >>>> >>>> +1 for the change >>>> >>>> On Tue, Jun 3, 2025, 08:12 Szehon Ho <szehon.apa...@gmail.com> wrote: >>>> >>>>> +1 , excited for this one too, we've seen the current metrics maps >>>>> blow up the memory and hope can improve that. >>>>> >>>>> On the Geo front, this could allow us to add supplementary metrics >>>>> that don't conform to the geo type, like S2 Cell Ids. >>>>> >>>>> Thanks >>>>> Szehon >>>>> >>>>> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner < >>>>> etudenhoef...@apache.org> wrote: >>>>> >>>>>> Hey everyone, >>>>>> >>>>>> I'm starting a thread to connect folks interested in improving the >>>>>> existing way of collecting column-level statistics (often referred to as >>>>>> *metrics* in the code). I've already started a proposal, which can >>>>>> be found at https://s.apache.org/iceberg-column-stats. >>>>>> >>>>>> *Motivation* >>>>>> >>>>>> Column statistics are currently stored as a mapping of field id to >>>>>> values across multiple columns (lower/upper bounds, value/nan/null >>>>>> counts, sizes). This storage model has critical limitations as the >>>>>> number of columns increases and as new types are being added to Iceberg: >>>>>> >>>>>> - >>>>>> >>>>>> Inefficient Storage due to map-based structure: >>>>>> - >>>>>> >>>>>> Large memory overhead during planning/processing >>>>>> - >>>>>> >>>>>> Inability to project specific stats (e.g., only >>>>>> null_value_counts for column X) >>>>>> - >>>>>> >>>>>> Type Erasure: Original logical/physical types are lost when >>>>>> stored as binary blobs, causing: >>>>>> - >>>>>> >>>>>> Lossy type inference during reads >>>>>> - Schema evolution challenges (e.g., widening types) >>>>>> - Rigid Schema: Stats are tied to the data_fil entry record, >>>>>> limiting extensibility for new stats. >>>>>> >>>>>> >>>>>> *Goals* >>>>>> >>>>>> Improve the column stats representation to allow for the following: >>>>>> >>>>>> - >>>>>> >>>>>> Projectability: Enable independent access to specific stats >>>>>> (e.g., lower_bounds without loading upper_bounds). >>>>>> - >>>>>> >>>>>> Type Preservation: Store original data types to support accurate >>>>>> reads and schema evolution. >>>>>> - >>>>>> >>>>>> Flexible/Extensible Representation: Allow per-field stats >>>>>> structures (e.g., complex types like Geo/Variant). >>>>>> >>>>>> >>>>>> >>>>>> Thanks >>>>>> Eduard >>>>>> >>>>>