Re: [DISCUSS] Statistics through the C data interface

Felipe Oliveira Carvalho Sat, 08 Jun 2024 13:17:22 -0700

> I just used quantiles as an example of a statistic that's not in the current 
> proposed spec, but that some engines would like to expose.


All statistics are optional so we can always add more to the spec.

> In other words, a plain integer makes extensibility more difficult than a 
> string.

Only the standardized metrics would be identified by an integer.
ARROW_STAT_ANY can be used + a string identifier for non-standard
metrics. Very similar to pre-defined Arrow types + Extension types
identified by string.

Since the C Data Interface is used to connect decoupled systems,
having a standard on most metrics would maximize the chances of
correct production and consumption of the statistics. The opaqueness
of the integer keys forces the reading of abi.h which contains the
explanation of the semantics of each metric.

On Sat, Jun 8, 2024 at 5:03 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
>
> Le 07/06/2024 à 18:30, Felipe Oliveira Carvalho a écrit :
> > On Fri, Jun 7, 2024 at 6:24 AM Antoine Pitrou <anto...@python.org> wrote:
> >>
> >>
> >> Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit :
> >>> I've been thinking about how to encode statistics on Arrow arrays and
> >>> how to keep the set of statistics known by both producers and
> >>> consumers (i.e. standardized).
> >>>
> >>> The statistics array(s) could be a
> >>>
> >>>     map<
> >>>       // the column index or null if the statistics refer to whole table 
> >>> or batch
> >>>       column: int32,
> >>>       map<int32, dense_union<...needed types based on stat kinds in the 
> >>> keys...>>
> >>>     >
> >>>
> >>> The keys would be defined as part of the standard:
> >>>
> >>> // Statistics values are identified by specified int32-valued keys
> >>> // so that producers and consumers can agree on physical
> >>> // encoding and semantics. Statistics can be about a column,
> >>> // a record batch, or both.
> >>> typedef ArrowStatKind int32_t;
> >>
> >> One thing that a plain integer makes more difficult is representing
> >> non-standard statistics. For example some engine might want to expose
> >> elaborate quantile-based statistics even if it not officially defined
> >> here. With a `utf8` or `dictionary(int32, utf8)` field, that is quite
> >> easy with some prefixing to ensure uniqueness. With a `int32` field, the
> >> spec would have to mention a mechanism to ensure global uniqueness of
> >> vendor-specific statistics.
> >
> > This encoding scheme can cover quantiles as well. Instead of parsing
> > strings or even naively matching just prefixes and breaking as
> > providers evolve (as already happens on some C Data interface
> > consumers), the consumers would expect a list of values in the enum
> > for a key called ARROW_STAT_QUANTILES.
>
> Ok, there's a misunderstanding. I did not claim that quantiles were
> difficult to represent. I just used quantiles as an example of a
> statistic that's not in the current proposed spec, but that some engines
> would like to expose. In other words, a plain integer makes
> extensibility more difficult than a string.
>
> Regards
>
> Antoine.

Re: [DISCUSS] Statistics through the C data interface

Reply via email to