Re: [DISCUSS] Statistics through the C data interface

Antoine Pitrou Sat, 08 Jun 2024 01:03:38 -0700



Le 07/06/2024 à 18:30, Felipe Oliveira Carvalho a écrit :

On Fri, Jun 7, 2024 at 6:24 AM Antoine Pitrou <[email protected]> wrote:



Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit :

I've been thinking about how to encode statistics on Arrow arrays and
how to keep the set of statistics known by both producers and
consumers (i.e. standardized).

The statistics array(s) could be a

    map<
      // the column index or null if the statistics refer to whole table or 
batch
      column: int32,
      map<int32, dense_union<...needed types based on stat kinds in the 
keys...>>
    >

The keys would be defined as part of the standard:

// Statistics values are identified by specified int32-valued keys
// so that producers and consumers can agree on physical
// encoding and semantics. Statistics can be about a column,
// a record batch, or both.
typedef ArrowStatKind int32_t;


One thing that a plain integer makes more difficult is representing
non-standard statistics. For example some engine might want to expose
elaborate quantile-based statistics even if it not officially defined
here. With a `utf8` or `dictionary(int32, utf8)` field, that is quite
easy with some prefixing to ensure uniqueness. With a `int32` field, the
spec would have to mention a mechanism to ensure global uniqueness of
vendor-specific statistics.


This encoding scheme can cover quantiles as well. Instead of parsing
strings or even naively matching just prefixes and breaking as
providers evolve (as already happens on some C Data interface
consumers), the consumers would expect a list of values in the enum
for a key called ARROW_STAT_QUANTILES.

Ok, there's a misunderstanding. I did not claim that quantiles weredifficult to represent. I just used quantiles as an example of astatistic that's not in the current proposed spec, but that some engineswould like to expose. In other words, a plain integer makesextensibility more difficult than a string.


Regards

Antoine.

Re: [DISCUSS] Statistics through the C data interface

Reply via email to