Hi, Here is an updated summary so far:
---- Use cases: * Optimize query plan: e.g. JOIN for DuckDB Out of scope: * Transmit statistics through not the C data interface Examples: * Transmit statistics through Apache Arrow IPC file * Transmit statistics through Apache Arrow Flight * Multi-column statistics * Constraints information * Indexes information Discussing approach: Standardize Apache Arrow schema for statistics and transmit statistics via separated API call that uses the C data interface. This also works for per-batch statistics. Candidate schema: map< // The column index or null if the statistics refer to whole table or batch. column: int32, // Statistics key is int32. // Different keys are assigned for exact value and // approximate value. map<int32, dense_union<...needed types based on stat kinds in the keys...>> > Discussions: 1. Can we use int32 for statistic keys? Should we use utf8 (or dictionary<int32, utf8>) for statistic keys? 2. Hot to support non-standard (vendor-specific) statistic keys? ---- Here is my idea: 1. We can use int32 for statistic keys. 2. We can reserve a specific range for non-standard statistic keys. Prerequisites of this: * There is no use case to merge some statistics for the same data. * We can't merge statistics for different data. If the prerequisites aren't satisfied: 1. We should use utf8 (or dictionary<int32, utf8>) for statistic keys? 2. We can use reserved prefix such as "ARROW:"/"arrow." for standard statistic keys or use prefix such as "vendor1:"/"vendor1." for non-standard statistic keys. Here is Felipe's idea: https://lists.apache.org/thread/gr2nmlrwr7d5wkz3zgq6vy5q0ow8xof2 1. We can use int32 for statistic keys. 2. We can use the special statistic key + a string identifier for non-standard statistic keys. What do you think about this? Thanks, -- kou In <20240606.182727.1004633558059795207....@clear-code.com> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 06 Jun 2024 18:27:27 +0900 (JST), Sutou Kouhei <k...@clear-code.com> wrote: > Hi, > > Thanks for sharing your comments. Here is a summary so far: > > ---- > > Use cases: > > * Optimize query plan: e.g. JOIN for DuckDB > > Out of scope: > > * Transmit statistics through not the C data interface > Examples: > * Transmit statistics through Apache Arrow IPC file > * Transmit statistics through Apache Arrow Flight > > Candidate approaches: > > 1. Pass statistics (encoded as an Apache Arrow data) via > ArrowSchema metadata > * This embeds statistics address into metadata > * It's for avoiding using Apache Arrow IPC format with > the C data interface > 2. Embed statistics (encoded as an Apache Arrow data) into > ArrowSchema metadata > * This adds statistics to metadata in Apache Arrow IPC > format > 3. Embed statistics (encoded as JSON) into ArrowArray > metadata > 4. Standardize Apache Arrow schema for statistics and > transmit statistics via separated API call that uses the > C data interface > 5. Use ADBC > > ---- > > I think that 4. is the best approach in these candidates. > > 1. Embedding statistics address is tricky. > 2. Consumers need to parse Apache Arrow IPC format data. > (The C data interface consumers may not have the > feature.) > 3. This will work but 4. is more generic. > 5. ADBC is too large to use only for statistics. > > What do you think about this? > > > If we select 4., we need to standardize Apache Arrow schema > for statistics. How about the following schema? > > ---- > Metadata: > > | Name | Value | Comments | > |----------------------------|-------|--------- | > | ARROW::statistics::version | 1.0.0 | (1) | > > (1) This follows semantic versioning. > > Fields: > > | Name | Type | Comments | > |----------------|-----------------------| -------- | > | column | utf8 | (2) | > | key | utf8 not null | (3) | > | value | VALUE_SCHEMA not null | | > | is_approximate | bool not null | (4) | > > (2) If null, then the statistic applies to the entire table. > It's for "row_count". > (3) We'll provide pre-defined keys such as "max", "min", > "byte_width" and "distinct_count" but users can also use > application specific keys. > (4) If true, then the value is approximate or best-effort. > > VALUE_SCHEMA is a dense union with members: > > | Name | Type | > |---------|---------| > | int64 | int64 | > | uint64 | uint64 | > | float64 | float64 | > | binary | binary | > > If a column is an int32 column, it uses int64 for > "max"/"min". We don't provide all types here. Users should > use a compatible type (int64 for a int32 column) instead. > ---- > > > Thanks, > -- > kou > > > In <20240522.113708.2023905028549001143....@clear-code.com> > "[DISCUSS] Statistics through the C data interface" on Wed, 22 May 2024 > 11:37:08 +0900 (JST), > Sutou Kouhei <k...@clear-code.com> wrote: > >> Hi, >> >> We're discussing how to provide statistics through the C >> data interface at: >> https://github.com/apache/arrow/issues/38837 >> >> If you're interested in this feature, could you share your >> comments? >> >> >> Motivation: >> >> We can interchange Apache Arrow data by the C data interface >> in the same process. For example, we can pass Apache Arrow >> data read by Apache Arrow C++ (provider) to DuckDB >> (consumer) through the C data interface. >> >> A provider may know Apache Arrow data statistics. For >> example, a provider can know statistics when it reads Apache >> Parquet data because Apache Parquet may provide statistics. >> >> But a consumer can't know statistics that are known by a >> producer. Because there isn't a standard way to provide >> statistics through the C data interface. If a consumer can >> know statistics, it can process Apache Arrow data faster >> based on statistics. >> >> >> Proposal: >> >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 >> >> How about providing statistics as a metadata in ArrowSchema? >> >> We reserve "ARROW" namespace for internal Apache Arrow use: >> >> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata >> >>> The ARROW pattern is a reserved namespace for internal >>> Arrow use in the custom_metadata fields. For example, >>> ARROW:extension:name. >> >> So we can use "ARROW:statistics" for the metadata key. >> >> We can represent statistics as a ArrowArray like ADBC does. >> >> Here is an example ArrowSchema that is for a record batch >> that has "int32 column1" and "string column2": >> >> ArrowSchema { >> .format = "+siu", >> .metadata = { >> "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row >> count */ >> }, >> .children = { >> ArrowSchema { >> .name = "column1", >> .format = "i", >> .metadata = { >> "ARROW:statistics" => ArrowArray*, /* column-level statistics such >> as count distinct */ >> }, >> }, >> ArrowSchema { >> .name = "column2", >> .format = "u", >> .metadata = { >> "ARROW:statistics" => ArrowArray*, /* column-level statistics such >> as count distinct */ >> }, >> }, >> }, >> } >> >> The metadata value (ArrowArray* part) of '"ARROW:statistics" >> => ArrowArray*' is a base 10 string of the address of the >> ArrowArray. Because we can use only string for metadata >> value. You can't release the statistics ArrowArray*. (Its >> release is a no-op function.) It follows >> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation >> semantics. (The base ArrowSchema owns statistics >> ArrowArray*.) >> >> >> ArrowArray* for statistics use the following schema: >> >> | Field Name | Field Type | Comments | >> |----------------|----------------------------------| -------- | >> | key | string not null | (1) | >> | value | `VALUE_SCHEMA` not null | | >> | is_approximate | bool not null | (2) | >> >> 1. We'll provide pre-defined keys such as "max", "min", >> "byte_width" and "distinct_count" but users can also use >> application specific keys. >> >> 2. If true, then the value is approximate or best-effort. >> >> VALUE_SCHEMA is a dense union with members: >> >> | Field Name | Field Type | Comments | >> |------------|----------------------------------| -------- | >> | int64 | int64 | | >> | uint64 | uint64 | | >> | float64 | float64 | | >> | value | The same type of the ArrowSchema | (3) | >> | | that is belonged to. | | >> >> 3. If the ArrowSchema's type is string, this type is also string. >> >> TODO: Is "value" good name? If we refer it from the >> top-level statistics schema, we need to use >> "value.value". It's a bit strange... >> >> >> What do you think about this proposal? Could you share your >> comments? >> >> >> Thanks, >> -- >> kou