Hi, > Why not simply pass the statistics ArrowArray separately in your > producer API of choice
It seems that we should use the approach because all feedback said so. How about the following schema for the statistics ArrowArray? It's based on ADBC. | Field Name | Field Type | Comments | |--------------------------|-----------------------| -------- | | column_name | utf8 | (1) | | statistic_key | utf8 not null | (2) | | statistic_value | VALUE_SCHEMA not null | | | statistic_is_approximate | bool not null | (3) | 1. If null, then the statistic applies to the entire table. It's for "row_count". 2. We'll provide pre-defined keys such as "max", "min", "byte_width" and "distinct_count" but users can also use application specific keys. 3. If true, then the value is approximate or best-effort. VALUE_SCHEMA is a dense union with members: | Field Name | Field Type | |------------|------------| | int64 | int64 | | uint64 | uint64 | | float64 | float64 | | binary | binary | If a column is an int32 column, it uses int64 for "max"/"min". We don't provide all types here. Users should use a compatible type (int64 for a int32 column) instead. Thanks, -- kou In <a3ce5e96-176c-4226-9d74-6a458317a...@python.org> "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May 2024 17:04:57 +0200, Antoine Pitrou <anto...@python.org> wrote: > > Hi Kou, > > I agree that Dewey that this is overstretching the capabilities of the > C Data Interface. In particular, stuffing a pointer as metadata value > and decreeing it immortal doesn't sound like a good design decision. > > Why not simply pass the statistics ArrowArray separately in your > producer API of choice (Dewey mentioned ADBC but it is of course just > a possible API among others)? > > Regards > > Antoine. > > > Le 22/05/2024 à 04:37, Sutou Kouhei a écrit : >> Hi, >> We're discussing how to provide statistics through the C >> data interface at: >> https://github.com/apache/arrow/issues/38837 >> If you're interested in this feature, could you share your >> comments? >> Motivation: >> We can interchange Apache Arrow data by the C data interface >> in the same process. For example, we can pass Apache Arrow >> data read by Apache Arrow C++ (provider) to DuckDB >> (consumer) through the C data interface. >> A provider may know Apache Arrow data statistics. For >> example, a provider can know statistics when it reads Apache >> Parquet data because Apache Parquet may provide statistics. >> But a consumer can't know statistics that are known by a >> producer. Because there isn't a standard way to provide >> statistics through the C data interface. If a consumer can >> know statistics, it can process Apache Arrow data faster >> based on statistics. >> Proposal: >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 >> How about providing statistics as a metadata in ArrowSchema? >> We reserve "ARROW" namespace for internal Apache Arrow use: >> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata >> >>> The ARROW pattern is a reserved namespace for internal >>> Arrow use in the custom_metadata fields. For example, >>> ARROW:extension:name. >> So we can use "ARROW:statistics" for the metadata key. >> We can represent statistics as a ArrowArray like ADBC does. >> Here is an example ArrowSchema that is for a record batch >> that has "int32 column1" and "string column2": >> ArrowSchema { >> .format = "+siu", >> .metadata = { >> "ARROW:statistics" => ArrowArray*, /* table-level statistics such as >> row count */ >> }, >> .children = { >> ArrowSchema { >> .name = "column1", >> .format = "i", >> .metadata = { >> "ARROW:statistics" => ArrowArray*, /* column-level statistics such >> as >> count distinct */ >> }, >> }, >> ArrowSchema { >> .name = "column2", >> .format = "u", >> .metadata = { >> "ARROW:statistics" => ArrowArray*, /* column-level statistics such >> as >> count distinct */ >> }, >> }, >> }, >> } >> The metadata value (ArrowArray* part) of '"ARROW:statistics" >> => ArrowArray*' is a base 10 string of the address of the >> ArrowArray. Because we can use only string for metadata >> value. You can't release the statistics ArrowArray*. (Its >> release is a no-op function.) It follows >> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation >> semantics. (The base ArrowSchema owns statistics >> ArrowArray*.) >> ArrowArray* for statistics use the following schema: >> | Field Name | Field Type | Comments | >> |----------------|----------------------------------| -------- | >> | key | string not null | (1) | >> | value | `VALUE_SCHEMA` not null | | >> | is_approximate | bool not null | (2) | >> 1. We'll provide pre-defined keys such as "max", "min", >> "byte_width" and "distinct_count" but users can also use >> application specific keys. >> 2. If true, then the value is approximate or best-effort. >> VALUE_SCHEMA is a dense union with members: >> | Field Name | Field Type | Comments | >> |------------|----------------------------------| -------- | >> | int64 | int64 | | >> | uint64 | uint64 | | >> | float64 | float64 | | >> | value | The same type of the ArrowSchema | (3) | >> | | that is belonged to. | | >> 3. If the ArrowSchema's type is string, this type is also string. >> TODO: Is "value" good name? If we refer it from the >> top-level statistics schema, we need to use >> "value.value". It's a bit strange... >> What do you think about this proposal? Could you share your >> comments? >> Thanks,