Le 07/06/2024 à 18:30, Felipe Oliveira Carvalho a écrit :
On Fri, Jun 7, 2024 at 6:24 AM Antoine Pitrou <anto...@python.org> wrote:


Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit :
I've been thinking about how to encode statistics on Arrow arrays and
how to keep the set of statistics known by both producers and
consumers (i.e. standardized).

The statistics array(s) could be a

    map<
      // the column index or null if the statistics refer to whole table or 
batch
      column: int32,
      map<int32, dense_union<...needed types based on stat kinds in the 
keys...>>
    >

The keys would be defined as part of the standard:

// Statistics values are identified by specified int32-valued keys
// so that producers and consumers can agree on physical
// encoding and semantics. Statistics can be about a column,
// a record batch, or both.
typedef ArrowStatKind int32_t;

One thing that a plain integer makes more difficult is representing
non-standard statistics. For example some engine might want to expose
elaborate quantile-based statistics even if it not officially defined
here. With a `utf8` or `dictionary(int32, utf8)` field, that is quite
easy with some prefixing to ensure uniqueness. With a `int32` field, the
spec would have to mention a mechanism to ensure global uniqueness of
vendor-specific statistics.

This encoding scheme can cover quantiles as well. Instead of parsing
strings or even naively matching just prefixes and breaking as
providers evolve (as already happens on some C Data interface
consumers), the consumers would expect a list of values in the enum
for a key called ARROW_STAT_QUANTILES.

Ok, there's a misunderstanding. I did not claim that quantiles were difficult to represent. I just used quantiles as an example of a statistic that's not in the current proposed spec, but that some engines would like to expose. In other words, a plain integer makes extensibility more difficult than a string.

Regards

Antoine.

Reply via email to