Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit :
I've been thinking about how to encode statistics on Arrow arrays and
how to keep the set of statistics known by both producers and
consumers (i.e. standardized).

The statistics array(s) could be a

   map<
     // the column index or null if the statistics refer to whole table or batch
     column: int32,
     map<int32, dense_union<...needed types based on stat kinds in the keys...>>
   >

The keys would be defined as part of the standard:

// Statistics values are identified by specified int32-valued keys
// so that producers and consumers can agree on physical
// encoding and semantics. Statistics can be about a column,
// a record batch, or both.
typedef ArrowStatKind int32_t;

One thing that a plain integer makes more difficult is representing non-standard statistics. For example some engine might want to expose elaborate quantile-based statistics even if it not officially defined here. With a `utf8` or `dictionary(int32, utf8)` field, that is quite easy with some prefixing to ensure uniqueness. With a `int32` field, the spec would have to mention a mechanism to ensure global uniqueness of vendor-specific statistics.

Version markers in two-sided protocols never work well long term:
see Parquet files lying about the version of the encoder so the files
can be read and web browsers lying on their User-Agent strings so
websites don't break. It's better to allow probing for individual
feature support (in this case, the presence of a specific stat kind in
the array).

+1 on this.

Regards

Antoine.

Reply via email to