I am definitely in favor of adding (or adopting an existing) ABI-stable way to transmit statistics (the one that comes up most frequently for me is just the number of values that are about to show up in an ArrowArrayStream, since the producer often knows this and the consumer often would like to preallocate).
I am skeptical of using the existing C ArrowSchema ABI to do this. The ArrowSchema is exceptionally good at representing Arrow data types (which in the presence of dictionaries, nested types, and extensions, is difficult to do); however, using it to handle all aspects of a consumer request/producer response I think dilutes its ability to do this well. If I'm understanding the proposal (and I may not be), the ArrowSchema will be used to encode data-dependent values, which means the same ArrowSchema is very tightly paired to a particular array stream (or array). This means that one could no longer (e.g.) consume an array stream and blindly assign each array in the stream the schema that was returned by get_schema(). This is not impossible to work around but it is a conceptual departure from the role the ArrowSchema has had in the past. Encoding pointers as strings in metadata is also a departure from what we have done previously. It is possible to condense the boilerplate of an ADBC driver to about 10 lines of code [1]. Is there a reason we can't use ADBC (or an extension to that standard) to more precisely handle those types of requests/responses (and extensions to them that come up in the future)? It is also not the first time it has come up to encode data-dependent information in a schema (e.g., encoding scalar/record batch-ness), so perhaps there is a need for another type of array stream or descriptor struct? [1] https://github.com/apache/arrow-adbc/blob/a40cf88408d6cb776cedeaa4d1d0945675c156cc/c/driver/common/driver_test.cc#L56-L66 On Wed, May 22, 2024 at 8:15 AM Raphael Taylor-Davies <r.taylordav...@googlemail.com.invalid> wrote: > > Hi, > > One potential challenge with encoding statistics in the schema metadata > is that some systems may consider this metadata as part of assessing > schema equivalence. > > However, I think the bigger question is what the intended use-case for > these statistics is? Often query engines want to collect statistics from > multiple containers in one go, as this allows for efficient vectorised > pruning across multiple files, row groups, etc... I therefore wonder if > the solution is simply to return separate arrays of min, max, etc... > potentially even grouped together into a single StructArray? > > This would have the benefit of not needing specification changes, whilst > being significantly more efficient than an approach centered on scalar > statistics. FWIW this is the approach taken by DataFusion for pruning > statistics [1], and in arrow-rs we represent scalars as arrays to avoid > needing to define a parallel serialization standard [2]. > > Kind Regards, > > Raphael > > [1]: > https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html > [2]: https://github.com/apache/arrow-rs/pull/4393 > > On 22/05/2024 03:37, Sutou Kouhei wrote: > > Hi, > > > > We're discussing how to provide statistics through the C > > data interface at: > > https://github.com/apache/arrow/issues/38837 > > > > If you're interested in this feature, could you share your > > comments? > > > > > > Motivation: > > > > We can interchange Apache Arrow data by the C data interface > > in the same process. For example, we can pass Apache Arrow > > data read by Apache Arrow C++ (provider) to DuckDB > > (consumer) through the C data interface. > > > > A provider may know Apache Arrow data statistics. For > > example, a provider can know statistics when it reads Apache > > Parquet data because Apache Parquet may provide statistics. > > > > But a consumer can't know statistics that are known by a > > producer. Because there isn't a standard way to provide > > statistics through the C data interface. If a consumer can > > know statistics, it can process Apache Arrow data faster > > based on statistics. > > > > > > Proposal: > > > > https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 > > > > How about providing statistics as a metadata in ArrowSchema? > > > > We reserve "ARROW" namespace for internal Apache Arrow use: > > > > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata > > > >> The ARROW pattern is a reserved namespace for internal > >> Arrow use in the custom_metadata fields. For example, > >> ARROW:extension:name. > > So we can use "ARROW:statistics" for the metadata key. > > > > We can represent statistics as a ArrowArray like ADBC does. > > > > Here is an example ArrowSchema that is for a record batch > > that has "int32 column1" and "string column2": > > > > ArrowSchema { > > .format = "+siu", > > .metadata = { > > "ARROW:statistics" => ArrowArray*, /* table-level statistics such as > > row count */ > > }, > > .children = { > > ArrowSchema { > > .name = "column1", > > .format = "i", > > .metadata = { > > "ARROW:statistics" => ArrowArray*, /* column-level statistics such > > as count distinct */ > > }, > > }, > > ArrowSchema { > > .name = "column2", > > .format = "u", > > .metadata = { > > "ARROW:statistics" => ArrowArray*, /* column-level statistics such > > as count distinct */ > > }, > > }, > > }, > > } > > > > The metadata value (ArrowArray* part) of '"ARROW:statistics" > > => ArrowArray*' is a base 10 string of the address of the > > ArrowArray. Because we can use only string for metadata > > value. You can't release the statistics ArrowArray*. (Its > > release is a no-op function.) It follows > > https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation > > semantics. (The base ArrowSchema owns statistics > > ArrowArray*.) > > > > > > ArrowArray* for statistics use the following schema: > > > > | Field Name | Field Type | Comments | > > |----------------|----------------------------------| -------- | > > | key | string not null | (1) | > > | value | `VALUE_SCHEMA` not null | | > > | is_approximate | bool not null | (2) | > > > > 1. We'll provide pre-defined keys such as "max", "min", > > "byte_width" and "distinct_count" but users can also use > > application specific keys. > > > > 2. If true, then the value is approximate or best-effort. > > > > VALUE_SCHEMA is a dense union with members: > > > > | Field Name | Field Type | Comments | > > |------------|----------------------------------| -------- | > > | int64 | int64 | | > > | uint64 | uint64 | | > > | float64 | float64 | | > > | value | The same type of the ArrowSchema | (3) | > > | | that is belonged to. | | > > > > 3. If the ArrowSchema's type is string, this type is also string. > > > > TODO: Is "value" good name? If we refer it from the > > top-level statistics schema, we need to use > > "value.value". It's a bit strange... > > > > > > What do you think about this proposal? Could you share your > > comments? > > > > > > Thanks,