Hi, We can use 4. for per-batch statistics. Because 4. uses separated API call. Users can design the separated API call for per-batch statistics.
Thanks, -- kou In <CAH=7pqywdtkrfnrxk_dakbgrrmwas6c3d7-u_33mc7dfsu4...@mail.gmail.com> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 6 Jun 2024 13:14:08 +0200, Alessandro Molina <alessan...@voltrondata.com.INVALID> wrote: > I brought it up on Github, but writing here too to avoid spawning too many > threads. > https://github.com/apache/arrow/issues/38837#issuecomment-2145343755 > > It's not something we have to address now, but it would be great if we > could design a solution that can be extended in the future to add Par-Batch > statistics in ArrowArrayStream. > > While it's true that in most cases the producer code will be applying the > filtering, in the case of C-Data we can't take that for granted. There > might be cases where the consumer has no control over the filtering that > the producer would apply and the producer might not be aware of the > filtering that the consumer might want to do. > > In those cases providing the statistics per-batch would allow the consumer > to skip the batches it doesn't care about, thus giving the opportunity for > a fast path. > > > > > > On Thu, Jun 6, 2024 at 11:42 AM Antoine Pitrou <anto...@python.org> wrote: > >> >> Hi Kou, >> >> Thanks for pushing for this! >> >> Le 06/06/2024 à 11:27, Sutou Kouhei a écrit : >> > 4. Standardize Apache Arrow schema for statistics and >> > transmit statistics via separated API call that uses the >> > C data interface >> [...] >> > >> > I think that 4. is the best approach in these candidates. >> >> I agree. >> >> > If we select 4., we need to standardize Apache Arrow schema >> > for statistics. How about the following schema? >> > >> > ---- >> > Metadata: >> > >> > | Name | Value | Comments | >> > |----------------------------|-------|--------- | >> > | ARROW::statistics::version | 1.0.0 | (1) | >> >> I'm not sure this is useful, but it doesn't hurt. >> >> Nit: this should be "ARROW:statistics:version" for consistency with >> https://arrow.apache.org/docs/format/Columnar.html#extension-types >> >> > Fields: >> > >> > | Name | Type | Comments | >> > |----------------|-----------------------| -------- | >> > | column | utf8 | (2) | >> > | key | utf8 not null | (3) | >> >> 1. Should the key be something like `dictionary(int32, utf8)` to make >> the representation more efficient where there are many columns? >> >> 2. Should the statistics perhaps be nested as a map type under each >> column to avoid repeating `column`, or is that overkill? >> >> 3. Should there also be room for multi-column statistics (such as >> cardinality of a given column pair), or is it too complex for now? >> >> Regards >> >> Antoine. >>