Hi, > One thing that a plain integer makes more difficult is representing > non-standard statistics. For example some engine might want to expose > elaborate quantile-based statistics even if it not officially defined > here. With a `utf8` or `dictionary(int32, utf8)` field, that is quite > easy with some prefixing to ensure uniqueness. With a `int32` field, > the spec would have to mention a mechanism to ensure global uniqueness > of vendor-specific statistics.
How about reserving a specific range (e.g. 10000-20000) for vendor-specific statistics? Statistics in the range aren't global unique but global uniqueness may not be needed in the specific producer-consumer communication. Thanks, -- kou In <f29411ef-793d-4eff-8ff2-248cc1a40...@python.org> "Re: [DISCUSS] Statistics through the C data interface" on Fri, 7 Jun 2024 10:05:48 +0200, Antoine Pitrou <anto...@python.org> wrote: > > Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit : >> I've been thinking about how to encode statistics on Arrow arrays and >> how to keep the set of statistics known by both producers and >> consumers (i.e. standardized). >> The statistics array(s) could be a >> map< >> // the column index or null if the statistics refer to whole table or >> batch >> column: int32, >> map<int32, dense_union<...needed types based on stat kinds in the >> keys...>> >> > >> The keys would be defined as part of the standard: >> // Statistics values are identified by specified int32-valued keys >> // so that producers and consumers can agree on physical >> // encoding and semantics. Statistics can be about a column, >> // a record batch, or both. >> typedef ArrowStatKind int32_t; > > One thing that a plain integer makes more difficult is representing > non-standard statistics. For example some engine might want to expose > elaborate quantile-based statistics even if it not officially defined > here. With a `utf8` or `dictionary(int32, utf8)` field, that is quite > easy with some prefixing to ensure uniqueness. With a `int32` field, > the spec would have to mention a mechanism to ensure global uniqueness > of vendor-specific statistics. > >> Version markers in two-sided protocols never work well long term: >> see Parquet files lying about the version of the encoder so the files >> can be read and web browsers lying on their User-Agent strings so >> websites don't break. It's better to allow probing for individual >> feature support (in this case, the presence of a specific stat kind in >> the array). > > +1 on this. > > Regards > > Antoine.