Hi, We're discussing how to provide statistics through the C data interface at: https://github.com/apache/arrow/issues/38837
If you're interested in this feature, could you share your comments? Motivation: We can interchange Apache Arrow data by the C data interface in the same process. For example, we can pass Apache Arrow data read by Apache Arrow C++ (provider) to DuckDB (consumer) through the C data interface. A provider may know Apache Arrow data statistics. For example, a provider can know statistics when it reads Apache Parquet data because Apache Parquet may provide statistics. But a consumer can't know statistics that are known by a producer. Because there isn't a standard way to provide statistics through the C data interface. If a consumer can know statistics, it can process Apache Arrow data faster based on statistics. Proposal: https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 How about providing statistics as a metadata in ArrowSchema? We reserve "ARROW" namespace for internal Apache Arrow use: https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata > The ARROW pattern is a reserved namespace for internal > Arrow use in the custom_metadata fields. For example, > ARROW:extension:name. So we can use "ARROW:statistics" for the metadata key. We can represent statistics as a ArrowArray like ADBC does. Here is an example ArrowSchema that is for a record batch that has "int32 column1" and "string column2": ArrowSchema { .format = "+siu", .metadata = { "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row count */ }, .children = { ArrowSchema { .name = "column1", .format = "i", .metadata = { "ARROW:statistics" => ArrowArray*, /* column-level statistics such as count distinct */ }, }, ArrowSchema { .name = "column2", .format = "u", .metadata = { "ARROW:statistics" => ArrowArray*, /* column-level statistics such as count distinct */ }, }, }, } The metadata value (ArrowArray* part) of '"ARROW:statistics" => ArrowArray*' is a base 10 string of the address of the ArrowArray. Because we can use only string for metadata value. You can't release the statistics ArrowArray*. (Its release is a no-op function.) It follows https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation semantics. (The base ArrowSchema owns statistics ArrowArray*.) ArrowArray* for statistics use the following schema: | Field Name | Field Type | Comments | |----------------|----------------------------------| -------- | | key | string not null | (1) | | value | `VALUE_SCHEMA` not null | | | is_approximate | bool not null | (2) | 1. We'll provide pre-defined keys such as "max", "min", "byte_width" and "distinct_count" but users can also use application specific keys. 2. If true, then the value is approximate or best-effort. VALUE_SCHEMA is a dense union with members: | Field Name | Field Type | Comments | |------------|----------------------------------| -------- | | int64 | int64 | | | uint64 | uint64 | | | float64 | float64 | | | value | The same type of the ArrowSchema | (3) | | | that is belonged to. | | 3. If the ArrowSchema's type is string, this type is also string. TODO: Is "value" good name? If we refer it from the top-level statistics schema, we need to use "value.value". It's a bit strange... What do you think about this proposal? Could you share your comments? Thanks, -- kou