On Fri, Jun 7, 2024 at 6:24 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit :
> > I've been thinking about how to encode statistics on Arrow arrays and
> > how to keep the set of statistics known by both producers and
> > consumers (i.e. standardized).
> >
> > The statistics array(s) could be a
> >
> >    map<
> >      // the column index or null if the statistics refer to whole table or 
> > batch
> >      column: int32,
> >      map<int32, dense_union<...needed types based on stat kinds in the 
> > keys...>>
> >    >
> >
> > The keys would be defined as part of the standard:
> >
> > // Statistics values are identified by specified int32-valued keys
> > // so that producers and consumers can agree on physical
> > // encoding and semantics. Statistics can be about a column,
> > // a record batch, or both.
> > typedef ArrowStatKind int32_t;
>
> One thing that a plain integer makes more difficult is representing
> non-standard statistics. For example some engine might want to expose
> elaborate quantile-based statistics even if it not officially defined
> here. With a `utf8` or `dictionary(int32, utf8)` field, that is quite
> easy with some prefixing to ensure uniqueness. With a `int32` field, the
> spec would have to mention a mechanism to ensure global uniqueness of
> vendor-specific statistics.

This encoding scheme can cover quantiles as well. Instead of parsing
strings or even naively matching just prefixes and breaking as
providers evolve (as already happens on some C Data interface
consumers), the consumers would expect a list of values in the enum
for a key called ARROW_STAT_QUANTILES.

/// ... Represented as a list<struct<quantile: float32|float64, sum:
"same as column type or a type with wider precision">
#define ARROW_STAT_CUMMULATIVE_QUANTILES ...
/// ...
#define ARROW_STAT_QUANTILES ...

--
Felipe

> > Version markers in two-sided protocols never work well long term:
> > see Parquet files lying about the version of the encoder so the files
> > can be read and web browsers lying on their User-Agent strings so
> > websites don't break. It's better to allow probing for individual
> > feature support (in this case, the presence of a specific stat kind in
> > the array).
>
> +1 on this.
>
> Regards
>
> Antoine.

Reply via email to