Re: [DISCUSS] Statistics through the C data interface

Sutou Kouhei Sun, 09 Jun 2024 00:01:27 -0700

Hi,

> One thing that a plain integer makes more difficult is representing
> non-standard statistics. For example some engine might want to expose
> elaborate quantile-based statistics even if it not officially defined
> here. With a `utf8` or `dictionary(int32, utf8)` field, that is quite
> easy with some prefixing to ensure uniqueness. With a `int32` field,
> the spec would have to mention a mechanism to ensure global uniqueness
> of vendor-specific statistics.


How about reserving a specific range (e.g. 10000-20000) for
vendor-specific statistics? Statistics in the range aren't
global unique but global uniqueness may not be needed in the
specific producer-consumer communication.


Thanks,
-- 
kou

In <f29411ef-793d-4eff-8ff2-248cc1a40...@python.org>
  "Re: [DISCUSS] Statistics through the C data interface" on Fri, 7 Jun 2024 
10:05:48 +0200,
  Antoine Pitrou <anto...@python.org> wrote:

> 
> Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit :
>> I've been thinking about how to encode statistics on Arrow arrays and
>> how to keep the set of statistics known by both producers and
>> consumers (i.e. standardized).
>> The statistics array(s) could be a
>>    map<
>>      // the column index or null if the statistics refer to whole table or
>>      batch
>>      column: int32,
>>      map<int32, dense_union<...needed types based on stat kinds in the
>>      keys...>>
>>    >
>> The keys would be defined as part of the standard:
>> // Statistics values are identified by specified int32-valued keys
>> // so that producers and consumers can agree on physical
>> // encoding and semantics. Statistics can be about a column,
>> // a record batch, or both.
>> typedef ArrowStatKind int32_t;
> 
> One thing that a plain integer makes more difficult is representing
> non-standard statistics. For example some engine might want to expose
> elaborate quantile-based statistics even if it not officially defined
> here. With a `utf8` or `dictionary(int32, utf8)` field, that is quite
> easy with some prefixing to ensure uniqueness. With a `int32` field,
> the spec would have to mention a mechanism to ensure global uniqueness
> of vendor-specific statistics.
> 
>> Version markers in two-sided protocols never work well long term:
>> see Parquet files lying about the version of the encoder so the files
>> can be read and web browsers lying on their User-Agent strings so
>> websites don't break. It's better to allow probing for individual
>> feature support (in this case, the presence of a specific stat kind in
>> the array).
> 
> +1 on this.
> 
> Regards
> 
> Antoine.

Re: [DISCUSS] Statistics through the C data interface

Reply via email to