On Fri, Jun 7, 2024 at 6:24 AM Antoine Pitrou <[email protected]> wrote: > > > Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit : > > I've been thinking about how to encode statistics on Arrow arrays and > > how to keep the set of statistics known by both producers and > > consumers (i.e. standardized). > > > > The statistics array(s) could be a > > > > map< > > // the column index or null if the statistics refer to whole table or > > batch > > column: int32, > > map<int32, dense_union<...needed types based on stat kinds in the > > keys...>> > > > > > > > The keys would be defined as part of the standard: > > > > // Statistics values are identified by specified int32-valued keys > > // so that producers and consumers can agree on physical > > // encoding and semantics. Statistics can be about a column, > > // a record batch, or both. > > typedef ArrowStatKind int32_t; > > One thing that a plain integer makes more difficult is representing > non-standard statistics. For example some engine might want to expose > elaborate quantile-based statistics even if it not officially defined > here. With a `utf8` or `dictionary(int32, utf8)` field, that is quite > easy with some prefixing to ensure uniqueness. With a `int32` field, the > spec would have to mention a mechanism to ensure global uniqueness of > vendor-specific statistics.
This encoding scheme can cover quantiles as well. Instead of parsing strings or even naively matching just prefixes and breaking as providers evolve (as already happens on some C Data interface consumers), the consumers would expect a list of values in the enum for a key called ARROW_STAT_QUANTILES. /// ... Represented as a list<struct<quantile: float32|float64, sum: "same as column type or a type with wider precision"> #define ARROW_STAT_CUMMULATIVE_QUANTILES ... /// ... #define ARROW_STAT_QUANTILES ... -- Felipe > > Version markers in two-sided protocols never work well long term: > > see Parquet files lying about the version of the encoder so the files > > can be read and web browsers lying on their User-Agent strings so > > websites don't break. It's better to allow probing for individual > > feature support (in this case, the presence of a specific stat kind in > > the array). > > +1 on this. > > Regards > > Antoine.
