+1 for the wonderful feature. Please count me in if you need any help.
Gábor Kaszab <gaborkas...@apache.org> 于2025年7月7日周一 21:22写道: > > +1 Seems a great improvement! Let me know if I can help out with > implementation, measurements, etc.! > > Regards, > Gabor Kaszab > > John Zhuge <jzh...@apache.org> ezt írta (időpont: 2025. jún. 5., Cs, 23:41): >> >> +1 Looking forward to this feature >> >> John Zhuge >> >> >> On Thu, Jun 5, 2025 at 2:22 PM Ryan Blue <rdb...@gmail.com> wrote: >>> >>> > I think it does not make sense to stick manifest files to Avro if we >>> > break column stats into sub fields. >>> >>> This isn't necessarily true. Avro can benefit from better pushdown with >>> Eduard's approach as well by being able to skip more efficiently. With the >>> current layout, Avro stores a list of key/value pairs that are all >>> projected and put into a map. We avoid decoding the values, but each field >>> ID is decoded, then the length of the value is decoded, and finally there >>> is a put operation with an ID and value ByteBuffer pair. With the new >>> approach, we will be able to know which fields are relevant and skip >>> unprojected fields based on the file schema, which we couldn't do before. >>> >>> To skip stats for an unused field (not part of the filter), there are two >>> cases. Lower/upper bounds for types that are fixed width are skipped by >>> updating the read position. And bounds for types that are variable length >>> (strings and binary) are skipped by reading the length and skipping that >>> number of bytes. >>> >>> It turns out that actually producing the metric maps is a fairly expensive >>> operation, so being able to skip metrics more quickly even if the bytes >>> still have to be read is going to save time. That said, using a columnar >>> format is still going to be a good idea! >>> >>> On Wed, Jun 4, 2025 at 11:22 PM Gang Wu <ust...@gmail.com> wrote: >>>> >>>> > Together with the change which allows storing metadata in columnar >>>> > formats >>>> >>>> +1 on this. I think it does not make sense to stick manifest files to Avro >>>> if we break column stats into sub fields. >>>> >>>> On Tue, Jun 3, 2025 at 7:19 PM Péter Váry <peter.vary.apa...@gmail.com> >>>> wrote: >>>>> >>>>> I would love to see more flexibility in file stats. Together with the >>>>> change which allows storing metadata in columnar formats will open up >>>>> many new possibilities. Bloom filters in metadata which could be used for >>>>> filtering out files, HLL scratches etc.... >>>>> >>>>> +1 for the change >>>>> >>>>> On Tue, Jun 3, 2025, 08:12 Szehon Ho <szehon.apa...@gmail.com> wrote: >>>>>> >>>>>> +1 , excited for this one too, we've seen the current metrics maps blow >>>>>> up the memory and hope can improve that. >>>>>> >>>>>> On the Geo front, this could allow us to add supplementary metrics that >>>>>> don't conform to the geo type, like S2 Cell Ids. >>>>>> >>>>>> Thanks >>>>>> Szehon >>>>>> >>>>>> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner >>>>>> <etudenhoef...@apache.org> wrote: >>>>>>> >>>>>>> Hey everyone, >>>>>>> >>>>>>> I'm starting a thread to connect folks interested in improving the >>>>>>> existing way of collecting column-level statistics (often referred to >>>>>>> as metrics in the code). I've already started a proposal, which can be >>>>>>> found at https://s.apache.org/iceberg-column-stats. >>>>>>> >>>>>>> Motivation >>>>>>> >>>>>>> Column statistics are currently stored as a mapping of field id to >>>>>>> values across multiple columns (lower/upper bounds, value/nan/null >>>>>>> counts, sizes). This storage model has critical limitations as the >>>>>>> number of columns increases and as new types are being added to Iceberg: >>>>>>> >>>>>>> Inefficient Storage due to map-based structure: >>>>>>> >>>>>>> Large memory overhead during planning/processing >>>>>>> >>>>>>> Inability to project specific stats (e.g., only null_value_counts for >>>>>>> column X) >>>>>>> >>>>>>> Type Erasure: Original logical/physical types are lost when stored as >>>>>>> binary blobs, causing: >>>>>>> >>>>>>> Lossy type inference during reads >>>>>>> >>>>>>> Schema evolution challenges (e.g., widening types) >>>>>>> >>>>>>> Rigid Schema: Stats are tied to the data_fil entry record, limiting >>>>>>> extensibility for new stats. >>>>>>> >>>>>>> >>>>>>> Goals >>>>>>> >>>>>>> Improve the column stats representation to allow for the following: >>>>>>> >>>>>>> Projectability: Enable independent access to specific stats (e.g., >>>>>>> lower_bounds without loading upper_bounds). >>>>>>> >>>>>>> Type Preservation: Store original data types to support accurate reads >>>>>>> and schema evolution. >>>>>>> >>>>>>> Flexible/Extensible Representation: Allow per-field stats structures >>>>>>> (e.g., complex types like Geo/Variant). >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thanks >>>>>>> Eduard