Re: [DISCUSS] Statistics through the C data interface
I've been thinking about how to encode statistics on Arrow arrays and how to keep the set of statistics known by both producers and consumers (i.e. standardized). The statistics array(s) could be a map< // the column index or null if the statistics refer to whole table or batch column: int32, map> > The keys would be defined as part of the standard: // Statistics values are identified by specified int32-valued keys // so that producers and consumers can agree on physical // encoding and semantics. Statistics can be about a column, // a record batch, or both. typedef ArrowStatKind int32_t; #define ARROW_STAT_ANY 0 // Exact number of nulls in a column. Value must be int32 or int64. #define ARROW_STAT_NULL_COUNT_EXACT 1 // Approximate number of nulls in a column. Value must be float32 or float64. #define ARROW_STAT_NULL_COUNT_APPROX 2 // The minimum and maximum values of a column. // Value must be the same type of the column. // Supported types are: ... #define ARROW_STAT_MIN_APROX 2 #define ARROW_STAT_MIN_NULLS_FIRST 4 #define ARROW_STAT_MIN_NULLS_LAST 5 #define ARROW_STAT_MAX_APROX 6 #define ARROW_STAT_MAX_NULLS_FIRST 7 #define ARROW_STAT_MAX_NULLS_LAST 8 #define ARROW_STAT_CARDINALITY_APPROX 9 #define ARROW_STAT_COUNT_DISTINCT_APPROX 10 Every key is optional and consumers that don't know or don't care about the stats can skip them while scanning statistics arrays. Applications would have their own domain classes for storing statistics (e.g. DuckDB's BaseStatistics [1]) and a way to pack and unpack into these arrays. The exact types inside the dense_union would be chosen when encoding. The decoder would handle the types expected and/or supported for each given stat kind. We wouldn't have to rely on versioning of the entire statistics objects. If we want a richer way to represent a maximum, we add another stat kind to the spec and keep producing both the old and the new representations for the maximum while consumers migrate to the new way. Version markers in two-sided protocols never work well long term: see Parquet files lying about the version of the encoder so the files can be read and web browsers lying on their User-Agent strings so websites don't break. It's better to allow probing for individual feature support (in this case, the presence of a specific stat kind in the array). Multiple calls could be done to load statistics and they could come with more statistics each time. -- Felipe [1] https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/base_statistics.hpp#L38-L146 On Thu, Jun 6, 2024 at 10:07 PM Dewey Dunnington wrote: > > Thank you for collecting all of our opinions on this! I also agree > that (4) is the best option. > > > Fields: > > > > | Name | Type | Comments | > > ||---| | > > | column | utf8 | (2) | > > The uft8 type would presume that column names are unique (although I > like it better than referring to columns by integer position). > > > If null, then the statistic applies to the entire table. > > Perhaps the NULL column value could also be used for the other > statistics in addition to a row count if the array is not a struct > array? > > > On Thu, Jun 6, 2024 at 6:42 AM Antoine Pitrou wrote: > > > > > > Hi Kou, > > > > Thanks for pushing for this! > > > > Le 06/06/2024 à 11:27, Sutou Kouhei a écrit : > > > 4. Standardize Apache Arrow schema for statistics and > > > transmit statistics via separated API call that uses the > > > C data interface > > [...] > > > > > > I think that 4. is the best approach in these candidates. > > > > I agree. > > > > > If we select 4., we need to standardize Apache Arrow schema > > > for statistics. How about the following schema? > > > > > > > > > Metadata: > > > > > > | Name | Value | Comments | > > > ||---|- | > > > | ARROW::statistics::version | 1.0.0 | (1) | > > > > I'm not sure this is useful, but it doesn't hurt. > > > > Nit: this should be "ARROW:statistics:version" for consistency with > > https://arrow.apache.org/docs/format/Columnar.html#extension-types > > > > > Fields: > > > > > > | Name | Type | Comments | > > > ||---| | > > > | column | utf8 | (2) | > > > | key| utf8 not null | (3) | > > > > 1. Should the key be something like `dictionary(int32, utf8)` to make > > the representation more efficient where there are many columns? > > > > 2. Should the statistics perhaps be nested as a map type under each > > column to avoid repeating `column`, or is that overkill? > > > > 3. Should there also be room for multi-column statistics (such as > > cardinality of a given column pair), or is it too complex for now? > > > > Regards > > > > Antoine.
Re: [DISCUSS] Statistics through the C data interface
Thank you for collecting all of our opinions on this! I also agree that (4) is the best option. > Fields: > > | Name | Type | Comments | > ||---| | > | column | utf8 | (2) | The uft8 type would presume that column names are unique (although I like it better than referring to columns by integer position). > If null, then the statistic applies to the entire table. Perhaps the NULL column value could also be used for the other statistics in addition to a row count if the array is not a struct array? On Thu, Jun 6, 2024 at 6:42 AM Antoine Pitrou wrote: > > > Hi Kou, > > Thanks for pushing for this! > > Le 06/06/2024 à 11:27, Sutou Kouhei a écrit : > > 4. Standardize Apache Arrow schema for statistics and > > transmit statistics via separated API call that uses the > > C data interface > [...] > > > > I think that 4. is the best approach in these candidates. > > I agree. > > > If we select 4., we need to standardize Apache Arrow schema > > for statistics. How about the following schema? > > > > > > Metadata: > > > > | Name | Value | Comments | > > ||---|- | > > | ARROW::statistics::version | 1.0.0 | (1) | > > I'm not sure this is useful, but it doesn't hurt. > > Nit: this should be "ARROW:statistics:version" for consistency with > https://arrow.apache.org/docs/format/Columnar.html#extension-types > > > Fields: > > > > | Name | Type | Comments | > > ||---| | > > | column | utf8 | (2) | > > | key| utf8 not null | (3) | > > 1. Should the key be something like `dictionary(int32, utf8)` to make > the representation more efficient where there are many columns? > > 2. Should the statistics perhaps be nested as a map type under each > column to avoid repeating `column`, or is that overkill? > > 3. Should there also be room for multi-column statistics (such as > cardinality of a given column pair), or is it too complex for now? > > Regards > > Antoine.
[RESULT][VOTE][RUST] Release Apache Arrow Rust 52.0.0 RC1
With 8 +1 votes (7 binding), including myself, the release is approved The release is available here: https://dist.apache.org/repos/dist/release/arrow/arrow-rs-52.0.0/ It has also been released to crates.io Thank you to everyone who helped verify this release Raphael On 03/06/2024 17:03, Raphael Taylor-Davies wrote: Hi, I would like to propose a release of Apache Arrow Rust Implementation, version 52.0.0. This release candidate is based on commit: f42218ae5d9c9f0b9ea3365f2b1e6025a43b8c71 [1] The proposed release tarball and signatures are hosted at [2]. The changelog is located at [3]. Please download, verify checksums and signatures, run the unit tests, and vote on the release. There is a script [4] that automates some of the verification. The vote will be open for at least 72 hours. [ ] +1 Release this as Apache Arrow Rust [ ] +0 [ ] -1 Do not release this as Apache Arrow Rust because... I vote +1 (binding) on this release [1]: https://github.com/apache/arrow-rs/tree/f42218ae5d9c9f0b9ea3365f2b1e6025a43b8c71 [2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-52.0.0-rc1 [3]: https://github.com/apache/arrow-rs/blob/f42218ae5d9c9f0b9ea3365f2b1e6025a43b8c71/CHANGELOG.md [4]: https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
Re: [DISCUSS] Statistics through the C data interface
I brought it up on Github, but writing here too to avoid spawning too many threads. https://github.com/apache/arrow/issues/38837#issuecomment-2145343755 It's not something we have to address now, but it would be great if we could design a solution that can be extended in the future to add Par-Batch statistics in ArrowArrayStream. While it's true that in most cases the producer code will be applying the filtering, in the case of C-Data we can't take that for granted. There might be cases where the consumer has no control over the filtering that the producer would apply and the producer might not be aware of the filtering that the consumer might want to do. In those cases providing the statistics per-batch would allow the consumer to skip the batches it doesn't care about, thus giving the opportunity for a fast path. On Thu, Jun 6, 2024 at 11:42 AM Antoine Pitrou wrote: > > Hi Kou, > > Thanks for pushing for this! > > Le 06/06/2024 à 11:27, Sutou Kouhei a écrit : > > 4. Standardize Apache Arrow schema for statistics and > > transmit statistics via separated API call that uses the > > C data interface > [...] > > > > I think that 4. is the best approach in these candidates. > > I agree. > > > If we select 4., we need to standardize Apache Arrow schema > > for statistics. How about the following schema? > > > > > > Metadata: > > > > | Name | Value | Comments | > > ||---|- | > > | ARROW::statistics::version | 1.0.0 | (1) | > > I'm not sure this is useful, but it doesn't hurt. > > Nit: this should be "ARROW:statistics:version" for consistency with > https://arrow.apache.org/docs/format/Columnar.html#extension-types > > > Fields: > > > > | Name | Type | Comments | > > ||---| | > > | column | utf8 | (2) | > > | key| utf8 not null | (3) | > > 1. Should the key be something like `dictionary(int32, utf8)` to make > the representation more efficient where there are many columns? > > 2. Should the statistics perhaps be nested as a map type under each > column to avoid repeating `column`, or is that overkill? > > 3. Should there also be room for multi-column statistics (such as > cardinality of a given column pair), or is it too complex for now? > > Regards > > Antoine. >
Re: [DISCUSS] Statistics through the C data interface
Hi Kou, Thanks for pushing for this! Le 06/06/2024 à 11:27, Sutou Kouhei a écrit : 4. Standardize Apache Arrow schema for statistics and transmit statistics via separated API call that uses the C data interface [...] I think that 4. is the best approach in these candidates. I agree. If we select 4., we need to standardize Apache Arrow schema for statistics. How about the following schema? Metadata: | Name | Value | Comments | ||---|- | | ARROW::statistics::version | 1.0.0 | (1) | I'm not sure this is useful, but it doesn't hurt. Nit: this should be "ARROW:statistics:version" for consistency with https://arrow.apache.org/docs/format/Columnar.html#extension-types Fields: | Name | Type | Comments | ||---| | | column | utf8 | (2) | | key| utf8 not null | (3) | 1. Should the key be something like `dictionary(int32, utf8)` to make the representation more efficient where there are many columns? 2. Should the statistics perhaps be nested as a map type under each column to avoid repeating `column`, or is that overkill? 3. Should there also be room for multi-column statistics (such as cardinality of a given column pair), or is it too complex for now? Regards Antoine.
Re: [DISCUSS] Statistics through the C data interface
Hi, Thanks for sharing your comments. Here is a summary so far: Use cases: * Optimize query plan: e.g. JOIN for DuckDB Out of scope: * Transmit statistics through not the C data interface Examples: * Transmit statistics through Apache Arrow IPC file * Transmit statistics through Apache Arrow Flight Candidate approaches: 1. Pass statistics (encoded as an Apache Arrow data) via ArrowSchema metadata * This embeds statistics address into metadata * It's for avoiding using Apache Arrow IPC format with the C data interface 2. Embed statistics (encoded as an Apache Arrow data) into ArrowSchema metadata * This adds statistics to metadata in Apache Arrow IPC format 3. Embed statistics (encoded as JSON) into ArrowArray metadata 4. Standardize Apache Arrow schema for statistics and transmit statistics via separated API call that uses the C data interface 5. Use ADBC I think that 4. is the best approach in these candidates. 1. Embedding statistics address is tricky. 2. Consumers need to parse Apache Arrow IPC format data. (The C data interface consumers may not have the feature.) 3. This will work but 4. is more generic. 5. ADBC is too large to use only for statistics. What do you think about this? If we select 4., we need to standardize Apache Arrow schema for statistics. How about the following schema? Metadata: | Name | Value | Comments | ||---|- | | ARROW::statistics::version | 1.0.0 | (1) | (1) This follows semantic versioning. Fields: | Name | Type | Comments | ||---| | | column | utf8 | (2) | | key| utf8 not null | (3) | | value | VALUE_SCHEMA not null | | | is_approximate | bool not null | (4) | (2) If null, then the statistic applies to the entire table. It's for "row_count". (3) We'll provide pre-defined keys such as "max", "min", "byte_width" and "distinct_count" but users can also use application specific keys. (4) If true, then the value is approximate or best-effort. VALUE_SCHEMA is a dense union with members: | Name| Type| |-|-| | int64 | int64 | | uint64 | uint64 | | float64 | float64 | | binary | binary | If a column is an int32 column, it uses int64 for "max"/"min". We don't provide all types here. Users should use a compatible type (int64 for a int32 column) instead. Thanks, -- kou In <20240522.113708.2023905028549001143@clear-code.com> "[DISCUSS] Statistics through the C data interface" on Wed, 22 May 2024 11:37:08 +0900 (JST), Sutou Kouhei wrote: > Hi, > > We're discussing how to provide statistics through the C > data interface at: > https://github.com/apache/arrow/issues/38837 > > If you're interested in this feature, could you share your > comments? > > > Motivation: > > We can interchange Apache Arrow data by the C data interface > in the same process. For example, we can pass Apache Arrow > data read by Apache Arrow C++ (provider) to DuckDB > (consumer) through the C data interface. > > A provider may know Apache Arrow data statistics. For > example, a provider can know statistics when it reads Apache > Parquet data because Apache Parquet may provide statistics. > > But a consumer can't know statistics that are known by a > producer. Because there isn't a standard way to provide > statistics through the C data interface. If a consumer can > know statistics, it can process Apache Arrow data faster > based on statistics. > > > Proposal: > > https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 > > How about providing statistics as a metadata in ArrowSchema? > > We reserve "ARROW" namespace for internal Apache Arrow use: > > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata > >> The ARROW pattern is a reserved namespace for internal >> Arrow use in the custom_metadata fields. For example, >> ARROW:extension:name. > > So we can use "ARROW:statistics" for the metadata key. > > We can represent statistics as a ArrowArray like ADBC does. > > Here is an example ArrowSchema that is for a record batch > that has "int32 column1" and "string column2": > > ArrowSchema { > .format = "+siu", > .metadata = { > "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row > count */ > }, > .children = { > ArrowSchema { > .name = "column1", > .format = "i", > .metadata = { > "ARROW:statistics" => ArrowArray*, /* column-level statistics such as > count distinct */ > }, > }, > ArrowSchema { > .name = "column2", > .format = "u", > .metadata = { > "ARROW:statistics" => ArrowArray*, /* column-level statistics such as > count distinct */ > }, > }, > }, > } > > The metadata value (A