Hi, Could you explain more about your idea? Does it propose that we add more callbacks to ArrowArrayStream such as ArrowArrayStream::get_statistics()? Or Does it propose that we define one more Arrow C XXX interface that wraps ArrowArrayStream like ArrowDeviceArray wraps ArrowArray?
ArrowDeviceArray: https://arrow.apache.org/docs/format/CDeviceDataInterface.html Thanks, -- kou In <cao-cae6rnfohw+gr4hw9hu5lhwcq6+wbsvc0j_xuayxqwgk...@mail.gmail.com> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 2024 06:55:40 -0700, Curt Hagenlocher <c...@hagenlocher.org> wrote: >> would it be easier to request statistics at a higher level of > abstraction? > > What if there were a "single table provider" level of abstraction between > ADBC and ArrowArrayStream as a C API; something that can report statistics > and apply simple predicates? > > On Thu, May 23, 2024 at 5:57 AM Dewey Dunnington > <de...@voltrondata.com.invalid> wrote: > >> Thank you for the background! I understand that these statistics are >> important for query planning; however, I am not sure that I follow why >> we are constrained to the ArrowSchema to represent them. The examples >> given seem to going through Python...would it be easier to request >> statistics at a higher level of abstraction? There would already need >> to be a separate mechanism to request an ArrowArrayStream with >> statistics (unless the PyCapsule `requested_schema` argument would >> suffice). >> >> > ADBC may be a bit larger to use only for transmitting >> > statistics. ADBC has statistics related APIs but it has more >> > other APIs. >> >> Some examples of producers given in the linked threads (Delta Lake, >> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One >> can implement an ADBC driver without defining all the methods (where >> the producer could call AdbcConnectionGetStatistics(), although >> AdbcStatementGetStatistics() might be more relevant here and doesn't >> exist). One example listed (using an Arrow Table as a source) seems a >> bit light to wrap in an ADBC driver; however, it would not take much >> code to do so and the overhead of getting the reader via ADBC it is >> something like 100 microseconds (tested via the ADBC R package's >> "monkey driver" which wraps an existing stream as a statement). In any >> case, the bulk of the code is building the statistics array. >> >> > How about the following schema for the >> > statistics ArrowArray? It's based on ADBC. >> >> Whatever format for statistics is decided on, I imagine it should be >> exactly the same as the ADBC standard? (Perhaps pushing changes >> upstream if needed?). >> >> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei <k...@clear-code.com> wrote: >> > >> > Hi, >> > >> > > Why not simply pass the statistics ArrowArray separately in your >> > > producer API of choice >> > >> > It seems that we should use the approach because all >> > feedback said so. How about the following schema for the >> > statistics ArrowArray? It's based on ADBC. >> > >> > | Field Name | Field Type | Comments | >> > |--------------------------|-----------------------| -------- | >> > | column_name | utf8 | (1) | >> > | statistic_key | utf8 not null | (2) | >> > | statistic_value | VALUE_SCHEMA not null | | >> > | statistic_is_approximate | bool not null | (3) | >> > >> > 1. If null, then the statistic applies to the entire table. >> > It's for "row_count". >> > 2. We'll provide pre-defined keys such as "max", "min", >> > "byte_width" and "distinct_count" but users can also use >> > application specific keys. >> > 3. If true, then the value is approximate or best-effort. >> > >> > VALUE_SCHEMA is a dense union with members: >> > >> > | Field Name | Field Type | >> > |------------|------------| >> > | int64 | int64 | >> > | uint64 | uint64 | >> > | float64 | float64 | >> > | binary | binary | >> > >> > If a column is an int32 column, it uses int64 for >> > "max"/"min". We don't provide all types here. Users should >> > use a compatible type (int64 for a int32 column) instead. >> > >> > >> > Thanks, >> > -- >> > kou >> > >> > In <a3ce5e96-176c-4226-9d74-6a458317a...@python.org> >> > "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May >> 2024 17:04:57 +0200, >> > Antoine Pitrou <anto...@python.org> wrote: >> > >> > > >> > > Hi Kou, >> > > >> > > I agree that Dewey that this is overstretching the capabilities of the >> > > C Data Interface. In particular, stuffing a pointer as metadata value >> > > and decreeing it immortal doesn't sound like a good design decision. >> > > >> > > Why not simply pass the statistics ArrowArray separately in your >> > > producer API of choice (Dewey mentioned ADBC but it is of course just >> > > a possible API among others)? >> > > >> > > Regards >> > > >> > > Antoine. >> > > >> > > >> > > Le 22/05/2024 à 04:37, Sutou Kouhei a écrit : >> > >> Hi, >> > >> We're discussing how to provide statistics through the C >> > >> data interface at: >> > >> https://github.com/apache/arrow/issues/38837 >> > >> If you're interested in this feature, could you share your >> > >> comments? >> > >> Motivation: >> > >> We can interchange Apache Arrow data by the C data interface >> > >> in the same process. For example, we can pass Apache Arrow >> > >> data read by Apache Arrow C++ (provider) to DuckDB >> > >> (consumer) through the C data interface. >> > >> A provider may know Apache Arrow data statistics. For >> > >> example, a provider can know statistics when it reads Apache >> > >> Parquet data because Apache Parquet may provide statistics. >> > >> But a consumer can't know statistics that are known by a >> > >> producer. Because there isn't a standard way to provide >> > >> statistics through the C data interface. If a consumer can >> > >> know statistics, it can process Apache Arrow data faster >> > >> based on statistics. >> > >> Proposal: >> > >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 >> > >> How about providing statistics as a metadata in ArrowSchema? >> > >> We reserve "ARROW" namespace for internal Apache Arrow use: >> > >> >> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata >> > >> >> > >>> The ARROW pattern is a reserved namespace for internal >> > >>> Arrow use in the custom_metadata fields. For example, >> > >>> ARROW:extension:name. >> > >> So we can use "ARROW:statistics" for the metadata key. >> > >> We can represent statistics as a ArrowArray like ADBC does. >> > >> Here is an example ArrowSchema that is for a record batch >> > >> that has "int32 column1" and "string column2": >> > >> ArrowSchema { >> > >> .format = "+siu", >> > >> .metadata = { >> > >> "ARROW:statistics" => ArrowArray*, /* table-level statistics >> such as >> > >> row count */ >> > >> }, >> > >> .children = { >> > >> ArrowSchema { >> > >> .name = "column1", >> > >> .format = "i", >> > >> .metadata = { >> > >> "ARROW:statistics" => ArrowArray*, /* column-level >> statistics such as >> > >> count distinct */ >> > >> }, >> > >> }, >> > >> ArrowSchema { >> > >> .name = "column2", >> > >> .format = "u", >> > >> .metadata = { >> > >> "ARROW:statistics" => ArrowArray*, /* column-level >> statistics such as >> > >> count distinct */ >> > >> }, >> > >> }, >> > >> }, >> > >> } >> > >> The metadata value (ArrowArray* part) of '"ARROW:statistics" >> > >> => ArrowArray*' is a base 10 string of the address of the >> > >> ArrowArray. Because we can use only string for metadata >> > >> value. You can't release the statistics ArrowArray*. (Its >> > >> release is a no-op function.) It follows >> > >> >> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation >> > >> semantics. (The base ArrowSchema owns statistics >> > >> ArrowArray*.) >> > >> ArrowArray* for statistics use the following schema: >> > >> | Field Name | Field Type | Comments | >> > >> |----------------|----------------------------------| -------- | >> > >> | key | string not null | (1) | >> > >> | value | `VALUE_SCHEMA` not null | | >> > >> | is_approximate | bool not null | (2) | >> > >> 1. We'll provide pre-defined keys such as "max", "min", >> > >> "byte_width" and "distinct_count" but users can also use >> > >> application specific keys. >> > >> 2. If true, then the value is approximate or best-effort. >> > >> VALUE_SCHEMA is a dense union with members: >> > >> | Field Name | Field Type | Comments | >> > >> |------------|----------------------------------| -------- | >> > >> | int64 | int64 | | >> > >> | uint64 | uint64 | | >> > >> | float64 | float64 | | >> > >> | value | The same type of the ArrowSchema | (3) | >> > >> | | that is belonged to. | | >> > >> 3. If the ArrowSchema's type is string, this type is also string. >> > >> TODO: Is "value" good name? If we refer it from the >> > >> top-level statistics schema, we need to use >> > >> "value.value". It's a bit strange... >> > >> What do you think about this proposal? Could you share your >> > >> comments? >> > >> Thanks, >>