Hi, > One potential challenge with encoding statistics in the schema > metadata is that some systems may consider this metadata as part of > assessing schema equivalence.
It's a good point. I didn't notice it. The proposed approach makes schemas different because they have addresses of ArrowArray for statistics not contents of ArrowArray for statistics. > However, I think the bigger question is what the intended use-case for > these statistics is? Ah, sorry. I should have also explained it in the initial e-mail. The intended use-case is for query planning. See the DuckDB use case in the issue: https://github.com/apache/arrow/issues/38837#issuecomment-2031914284 > I > therefore wonder if the solution is simply to return separate arrays > of min, max, etc... potentially even grouped together into a single > StructArray? Yes. It's a candidate approach. I also considered similar approach that is based of ADBC. It's a bit abstracted than the separated arrays/grouped StructArray because we can represent not only pre-defined statistics (min, max, etc...) but also application-specific statistics: https://github.com/apache/arrow/issues/38837#issuecomment-2074371230 But we can't provide a cross-language API with this approach. Apache Arrow C++ will provide arrow::RecordBatch::statistics() or something but Apache Arrow Rust will use different API. So I proposed the C data interface based approach. We'll define only schema for statistics record batch with the ADBC like approach or the grouped StructArray approach. Is it sufficient for sharing statistics widely? > FWIW this is the approach taken by DataFusion for > pruning statistics [1], Thanks for sharing this. It seems that DataFusion supports only pre-defined statistics. Does DataFusion ever require application-specific statistics? I'm not sure whether we should support application-specific statistics or not. Thanks, -- kou In <fc7ae05d-f420-4c74-8c01-b77e9a256...@googlemail.com> "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May 2024 12:14:40 +0100, Raphael Taylor-Davies <r.taylordav...@googlemail.com.INVALID> wrote: > Hi, > > One potential challenge with encoding statistics in the schema > metadata is that some systems may consider this metadata as part of > assessing schema equivalence. > > However, I think the bigger question is what the intended use-case for > these statistics is? Often query engines want to collect statistics > from multiple containers in one go, as this allows for efficient > vectorised pruning across multiple files, row groups, etc... I > therefore wonder if the solution is simply to return separate arrays > of min, max, etc... potentially even grouped together into a single > StructArray? > > This would have the benefit of not needing specification changes, > whilst being significantly more efficient than an approach centered on > scalar statistics. FWIW this is the approach taken by DataFusion for > pruning statistics [1], and in arrow-rs we represent scalars as arrays > to avoid needing to define a parallel serialization standard [2]. > > Kind Regards, > > Raphael > > [1]: > https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html > [2]: https://github.com/apache/arrow-rs/pull/4393 > > On 22/05/2024 03:37, Sutou Kouhei wrote: >> Hi, >> >> We're discussing how to provide statistics through the C >> data interface at: >> https://github.com/apache/arrow/issues/38837 >> >> If you're interested in this feature, could you share your >> comments? >> >> >> Motivation: >> >> We can interchange Apache Arrow data by the C data interface >> in the same process. For example, we can pass Apache Arrow >> data read by Apache Arrow C++ (provider) to DuckDB >> (consumer) through the C data interface. >> >> A provider may know Apache Arrow data statistics. For >> example, a provider can know statistics when it reads Apache >> Parquet data because Apache Parquet may provide statistics. >> >> But a consumer can't know statistics that are known by a >> producer. Because there isn't a standard way to provide >> statistics through the C data interface. If a consumer can >> know statistics, it can process Apache Arrow data faster >> based on statistics. >> >> >> Proposal: >> >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 >> >> How about providing statistics as a metadata in ArrowSchema? >> >> We reserve "ARROW" namespace for internal Apache Arrow use: >> >> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata >> >>> The ARROW pattern is a reserved namespace for internal >>> Arrow use in the custom_metadata fields. For example, >>> ARROW:extension:name. >> So we can use "ARROW:statistics" for the metadata key. >> >> We can represent statistics as a ArrowArray like ADBC does. >> >> Here is an example ArrowSchema that is for a record batch >> that has "int32 column1" and "string column2": >> >> ArrowSchema { >> .format = "+siu", >> .metadata = { >> "ARROW:statistics" => ArrowArray*, /* table-level statistics such as >> row count */ >> }, >> .children = { >> ArrowSchema { >> .name = "column1", >> .format = "i", >> .metadata = { >> "ARROW:statistics" => ArrowArray*, /* column-level statistics such >> as >> count distinct */ >> }, >> }, >> ArrowSchema { >> .name = "column2", >> .format = "u", >> .metadata = { >> "ARROW:statistics" => ArrowArray*, /* column-level statistics such >> as >> count distinct */ >> }, >> }, >> }, >> } >> >> The metadata value (ArrowArray* part) of '"ARROW:statistics" >> => ArrowArray*' is a base 10 string of the address of the >> ArrowArray. Because we can use only string for metadata >> value. You can't release the statistics ArrowArray*. (Its >> release is a no-op function.) It follows >> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation >> semantics. (The base ArrowSchema owns statistics >> ArrowArray*.) >> >> >> ArrowArray* for statistics use the following schema: >> >> | Field Name | Field Type | Comments | >> |----------------|----------------------------------| -------- | >> | key | string not null | (1) | >> | value | `VALUE_SCHEMA` not null | | >> | is_approximate | bool not null | (2) | >> >> 1. We'll provide pre-defined keys such as "max", "min", >> "byte_width" and "distinct_count" but users can also use >> application specific keys. >> >> 2. If true, then the value is approximate or best-effort. >> >> VALUE_SCHEMA is a dense union with members: >> >> | Field Name | Field Type | Comments | >> |------------|----------------------------------| -------- | >> | int64 | int64 | | >> | uint64 | uint64 | | >> | float64 | float64 | | >> | value | The same type of the ArrowSchema | (3) | >> | | that is belonged to. | | >> >> 3. If the ArrowSchema's type is string, this type is also string. >> >> TODO: Is "value" good name? If we refer it from the >> top-level statistics schema, we need to use >> "value.value". It's a bit strange... >> >> >> What do you think about this proposal? Could you share your >> comments? >> >> >> Thanks,