deniskuzZ commented on code in PR #8202: URL: https://github.com/apache/iceberg/pull/8202#discussion_r1941767628
########## format/puffin-spec.md: ########## @@ -181,6 +181,23 @@ for Puffin v1. [roaring-bitmap-portable-serialization]: https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#extension-for-64-bit-implementations [roaring-bitmap-general-layout]: https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#general-layout +#### `hive-column-statistics-obj` blob type + +A serialized form of Hive ColumnStatsObject. + +The ColumnStatsObject supports Histograms, NDV, Min and Max values, Number of nulls, Number of trues, column name, type. +A full list of supported statistics is listed in the table here: +[ColumnStatistics](https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ColumnStatistics) Review Comment: hi @rdblue, thanks for checking this PR! `The partition statistics files provide a way to aggregate those beyond the file level` Does iceberg provide build-in support to get an aggregated Column stats? I mean, is there some library/service that generates partition files with an aggregated column stats? AFAIK we only do this for basic stats : https://github.com/apache/iceberg/pull/11216 If yes, could you please point me to the code where is that done? I had an impression that from colstats only NDV is calculated and stored in partition files. How about: 1. bitvectors - used to improve stats estimations for IN operator 2. histogram - histogram statistics, which are particularly useful for skewed data and range predicates (KLL data sketches) 3. numTrue/numFalse -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
