Hi, I agree with the proposed approach is a departure use of ArrowSchema.
ADBC may be a bit larger to use only for transmitting statistics. ADBC has statistics related APIs but it has more other APIs. > It is also not the first time it has come up to encode > data-dependent information in a schema (e.g., encoding scalar/record > batch-ness), so perhaps there is a need for another type of array > stream or descriptor struct? This may be a candidate approach. Or we may want to add ArrowArrayStream::get_statistics() callback. But it brakes ABI... How about only defining a schema for statistics ArrowArray? It doesn't define how to get the statistics ArrowArray like the Arrow C data interface but defining a schema for statistics ArrowArray will improve how to transmit statistics. Thanks, -- kou In <CAFb7qScyETeHxAo+uHcrqCYe=7qcmkseq+zurzm5ckcjjvu...@mail.gmail.com> "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May 2024 11:38:13 -0300, Dewey Dunnington <de...@voltrondata.com.INVALID> wrote: > I am definitely in favor of adding (or adopting an existing) > ABI-stable way to transmit statistics (the one that comes up most > frequently for me is just the number of values that are about to show > up in an ArrowArrayStream, since the producer often knows this and the > consumer often would like to preallocate). > > I am skeptical of using the existing C ArrowSchema ABI to do this. The > ArrowSchema is exceptionally good at representing Arrow data types > (which in the presence of dictionaries, nested types, and extensions, > is difficult to do); however, using it to handle all aspects of a > consumer request/producer response I think dilutes its ability to do > this well. > > If I'm understanding the proposal (and I may not be), the ArrowSchema > will be used to encode data-dependent values, which means the same > ArrowSchema is very tightly paired to a particular array stream (or > array). This means that one could no longer (e.g.) consume an array > stream and blindly assign each array in the stream the schema that was > returned by get_schema(). This is not impossible to work around but it > is a conceptual departure from the role the ArrowSchema has had in the > past. Encoding pointers as strings in metadata is also a departure > from what we have done previously. > > It is possible to condense the boilerplate of an ADBC driver to about > 10 lines of code [1]. Is there a reason we can't use ADBC (or an > extension to that standard) to more precisely handle those types of > requests/responses (and extensions to them that come up in the > future)? It is also not the first time it has come up to encode > data-dependent information in a schema (e.g., encoding scalar/record > batch-ness), so perhaps there is a need for another type of array > stream or descriptor struct? > > [1] > https://github.com/apache/arrow-adbc/blob/a40cf88408d6cb776cedeaa4d1d0945675c156cc/c/driver/common/driver_test.cc#L56-L66 > > On Wed, May 22, 2024 at 8:15 AM Raphael Taylor-Davies > <r.taylordav...@googlemail.com.invalid> wrote: >> >> Hi, >> >> One potential challenge with encoding statistics in the schema metadata >> is that some systems may consider this metadata as part of assessing >> schema equivalence. >> >> However, I think the bigger question is what the intended use-case for >> these statistics is? Often query engines want to collect statistics from >> multiple containers in one go, as this allows for efficient vectorised >> pruning across multiple files, row groups, etc... I therefore wonder if >> the solution is simply to return separate arrays of min, max, etc... >> potentially even grouped together into a single StructArray? >> >> This would have the benefit of not needing specification changes, whilst >> being significantly more efficient than an approach centered on scalar >> statistics. FWIW this is the approach taken by DataFusion for pruning >> statistics [1], and in arrow-rs we represent scalars as arrays to avoid >> needing to define a parallel serialization standard [2]. >> >> Kind Regards, >> >> Raphael >> >> [1]: >> https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html >> [2]: https://github.com/apache/arrow-rs/pull/4393 >> >> On 22/05/2024 03:37, Sutou Kouhei wrote: >> > Hi, >> > >> > We're discussing how to provide statistics through the C >> > data interface at: >> > https://github.com/apache/arrow/issues/38837 >> > >> > If you're interested in this feature, could you share your >> > comments? >> > >> > >> > Motivation: >> > >> > We can interchange Apache Arrow data by the C data interface >> > in the same process. For example, we can pass Apache Arrow >> > data read by Apache Arrow C++ (provider) to DuckDB >> > (consumer) through the C data interface. >> > >> > A provider may know Apache Arrow data statistics. For >> > example, a provider can know statistics when it reads Apache >> > Parquet data because Apache Parquet may provide statistics. >> > >> > But a consumer can't know statistics that are known by a >> > producer. Because there isn't a standard way to provide >> > statistics through the C data interface. If a consumer can >> > know statistics, it can process Apache Arrow data faster >> > based on statistics. >> > >> > >> > Proposal: >> > >> > https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 >> > >> > How about providing statistics as a metadata in ArrowSchema? >> > >> > We reserve "ARROW" namespace for internal Apache Arrow use: >> > >> > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata >> > >> >> The ARROW pattern is a reserved namespace for internal >> >> Arrow use in the custom_metadata fields. For example, >> >> ARROW:extension:name. >> > So we can use "ARROW:statistics" for the metadata key. >> > >> > We can represent statistics as a ArrowArray like ADBC does. >> > >> > Here is an example ArrowSchema that is for a record batch >> > that has "int32 column1" and "string column2": >> > >> > ArrowSchema { >> > .format = "+siu", >> > .metadata = { >> > "ARROW:statistics" => ArrowArray*, /* table-level statistics such as >> > row count */ >> > }, >> > .children = { >> > ArrowSchema { >> > .name = "column1", >> > .format = "i", >> > .metadata = { >> > "ARROW:statistics" => ArrowArray*, /* column-level statistics >> > such as count distinct */ >> > }, >> > }, >> > ArrowSchema { >> > .name = "column2", >> > .format = "u", >> > .metadata = { >> > "ARROW:statistics" => ArrowArray*, /* column-level statistics >> > such as count distinct */ >> > }, >> > }, >> > }, >> > } >> > >> > The metadata value (ArrowArray* part) of '"ARROW:statistics" >> > => ArrowArray*' is a base 10 string of the address of the >> > ArrowArray. Because we can use only string for metadata >> > value. You can't release the statistics ArrowArray*. (Its >> > release is a no-op function.) It follows >> > https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation >> > semantics. (The base ArrowSchema owns statistics >> > ArrowArray*.) >> > >> > >> > ArrowArray* for statistics use the following schema: >> > >> > | Field Name | Field Type | Comments | >> > |----------------|----------------------------------| -------- | >> > | key | string not null | (1) | >> > | value | `VALUE_SCHEMA` not null | | >> > | is_approximate | bool not null | (2) | >> > >> > 1. We'll provide pre-defined keys such as "max", "min", >> > "byte_width" and "distinct_count" but users can also use >> > application specific keys. >> > >> > 2. If true, then the value is approximate or best-effort. >> > >> > VALUE_SCHEMA is a dense union with members: >> > >> > | Field Name | Field Type | Comments | >> > |------------|----------------------------------| -------- | >> > | int64 | int64 | | >> > | uint64 | uint64 | | >> > | float64 | float64 | | >> > | value | The same type of the ArrowSchema | (3) | >> > | | that is belonged to. | | >> > >> > 3. If the ArrowSchema's type is string, this type is also string. >> > >> > TODO: Is "value" good name? If we refer it from the >> > top-level statistics schema, we need to use >> > "value.value". It's a bit strange... >> > >> > >> > What do you think about this proposal? Could you share your >> > comments? >> > >> > >> > Thanks,