I brought it up on Github, but writing here too to avoid spawning too many
threads.
https://github.com/apache/arrow/issues/38837#issuecomment-2145343755

It's not something we have to address now, but it would be great if we
could design a solution that can be extended in the future to add Par-Batch
statistics in ArrowArrayStream.

While it's true that in most cases the producer code will be applying the
filtering, in the case of C-Data we can't take that for granted. There
might be cases where the consumer has no control over the filtering that
the producer would apply and the producer might not be aware of the
filtering that the consumer might want to do.

In those cases providing the statistics per-batch would allow the consumer
to skip the batches it doesn't care about, thus giving the opportunity for
a fast path.





On Thu, Jun 6, 2024 at 11:42 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Hi Kou,
>
> Thanks for pushing for this!
>
> Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
> > 4. Standardize Apache Arrow schema for statistics and
> >     transmit statistics via separated API call that uses the
> >     C data interface
> [...]
> >
> > I think that 4. is the best approach in these candidates.
>
> I agree.
>
> > If we select 4., we need to standardize Apache Arrow schema
> > for statistics. How about the following schema?
> >
> > ----
> > Metadata:
> >
> > | Name                       | Value | Comments |
> > |----------------------------|-------|--------- |
> > | ARROW::statistics::version | 1.0.0 | (1)      |
>
> I'm not sure this is useful, but it doesn't hurt.
>
> Nit: this should be "ARROW:statistics:version" for consistency with
> https://arrow.apache.org/docs/format/Columnar.html#extension-types
>
> > Fields:
> >
> > | Name           | Type                  | Comments |
> > |----------------|-----------------------| -------- |
> > | column         | utf8                  | (2)      |
> > | key            | utf8 not null         | (3)      |
>
> 1. Should the key be something like `dictionary(int32, utf8)` to make
> the representation more efficient where there are many columns?
>
> 2. Should the statistics perhaps be nested as a map type under each
> column to avoid repeating `column`, or is that overkill?
>
> 3. Should there also be room for multi-column statistics (such as
> cardinality of a given column pair), or is it too complex for now?
>
> Regards
>
> Antoine.
>

Reply via email to