Re: [DISCUSS] Statistics through the C data interface

Sutou Kouhei Sat, 08 Jun 2024 23:37:33 -0700

Hi,

We can use 4. for per-batch statistics. Because 4. uses
separated API call. Users can design the separated API call
for per-batch statistics.


Thanks,
-- 
kou

In <CAH=7pqywdtkrfnrxk_dakbgrrmwas6c3d7-u_33mc7dfsu4...@mail.gmail.com>
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 6 Jun 2024 
13:14:08 +0200,
  Alessandro Molina <[email protected]> wrote:

> I brought it up on Github, but writing here too to avoid spawning too many
> threads.
> https://github.com/apache/arrow/issues/38837#issuecomment-2145343755
> 
> It's not something we have to address now, but it would be great if we
> could design a solution that can be extended in the future to add Par-Batch
> statistics in ArrowArrayStream.
> 
> While it's true that in most cases the producer code will be applying the
> filtering, in the case of C-Data we can't take that for granted. There
> might be cases where the consumer has no control over the filtering that
> the producer would apply and the producer might not be aware of the
> filtering that the consumer might want to do.
> 
> In those cases providing the statistics per-batch would allow the consumer
> to skip the batches it doesn't care about, thus giving the opportunity for
> a fast path.
> 
> 
> 
> 
> 
> On Thu, Jun 6, 2024 at 11:42 AM Antoine Pitrou <[email protected]> wrote:
> 
>>
>> Hi Kou,
>>
>> Thanks for pushing for this!
>>
>> Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
>> > 4. Standardize Apache Arrow schema for statistics and
>> >     transmit statistics via separated API call that uses the
>> >     C data interface
>> [...]
>> >
>> > I think that 4. is the best approach in these candidates.
>>
>> I agree.
>>
>> > If we select 4., we need to standardize Apache Arrow schema
>> > for statistics. How about the following schema?
>> >
>> > ----
>> > Metadata:
>> >
>> > | Name                       | Value | Comments |
>> > |----------------------------|-------|--------- |
>> > | ARROW::statistics::version | 1.0.0 | (1)      |
>>
>> I'm not sure this is useful, but it doesn't hurt.
>>
>> Nit: this should be "ARROW:statistics:version" for consistency with
>> https://arrow.apache.org/docs/format/Columnar.html#extension-types
>>
>> > Fields:
>> >
>> > | Name           | Type                  | Comments |
>> > |----------------|-----------------------| -------- |
>> > | column         | utf8                  | (2)      |
>> > | key            | utf8 not null         | (3)      |
>>
>> 1. Should the key be something like `dictionary(int32, utf8)` to make
>> the representation more efficient where there are many columns?
>>
>> 2. Should the statistics perhaps be nested as a map type under each
>> column to avoid repeating `column`, or is that overkill?
>>
>> 3. Should there also be room for multi-column statistics (such as
>> cardinality of a given column pair), or is it too complex for now?
>>
>> Regards
>>
>> Antoine.
>>

Re: [DISCUSS] Statistics through the C data interface

Reply via email to