Re: [DISCUSS] Statistics through the C data interface

Sutou Kouhei Fri, 24 May 2024 01:20:35 -0700

Hi,

Could you explain more about your idea? Does it propose that
we add more callbacks to ArrowArrayStream such as
ArrowArrayStream::get_statistics()? Or Does it propose that
we define one more Arrow C XXX interface that wraps
ArrowArrayStream like ArrowDeviceArray wraps ArrowArray?


ArrowDeviceArray:
https://arrow.apache.org/docs/format/CDeviceDataInterface.html


Thanks,
-- 
kou

In <cao-cae6rnfohw+gr4hw9hu5lhwcq6+wbsvc0j_xuayxqwgk...@mail.gmail.com>
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 2024 
06:55:40 -0700,
  Curt Hagenlocher <[email protected]> wrote:

>>  would it be easier to request statistics at a higher level of
> abstraction?
> 
> What if there were a "single table provider" level of abstraction between
> ADBC and ArrowArrayStream as a C API; something that can report statistics
> and apply simple predicates?
> 
> On Thu, May 23, 2024 at 5:57 AM Dewey Dunnington
> <[email protected]> wrote:
> 
>> Thank you for the background! I understand that these statistics are
>> important for query planning; however, I am not sure that I follow why
>> we are constrained to the ArrowSchema to represent them. The examples
>> given seem to going through Python...would it be easier to request
>> statistics at a higher level of abstraction? There would already need
>> to be a separate mechanism to request an ArrowArrayStream with
>> statistics (unless the PyCapsule `requested_schema` argument would
>> suffice).
>>
>> > ADBC may be a bit larger to use only for transmitting
>> > statistics. ADBC has statistics related APIs but it has more
>> > other APIs.
>>
>> Some examples of producers given in the linked threads (Delta Lake,
>> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
>> can implement an ADBC driver without defining all the methods (where
>> the producer could call AdbcConnectionGetStatistics(), although
>> AdbcStatementGetStatistics() might be more relevant here and doesn't
>> exist). One example listed (using an Arrow Table as a source) seems a
>> bit light to wrap in an ADBC driver; however, it would not take much
>> code to do so and the overhead of getting the reader via ADBC it is
>> something like 100 microseconds (tested via the ADBC R package's
>> "monkey driver" which wraps an existing stream as a statement). In any
>> case, the bulk of the code is building the statistics array.
>>
>> > How about the following schema for the
>> > statistics ArrowArray? It's based on ADBC.
>>
>> Whatever format for statistics is decided on, I imagine it should be
>> exactly the same as the ADBC standard? (Perhaps pushing changes
>> upstream if needed?).
>>
>> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei <[email protected]> wrote:
>> >
>> > Hi,
>> >
>> > > Why not simply pass the statistics ArrowArray separately in your
>> > > producer API of choice
>> >
>> > It seems that we should use the approach because all
>> > feedback said so. How about the following schema for the
>> > statistics ArrowArray? It's based on ADBC.
>> >
>> > | Field Name               | Field Type            | Comments |
>> > |--------------------------|-----------------------| -------- |
>> > | column_name              | utf8                  | (1)      |
>> > | statistic_key            | utf8 not null         | (2)      |
>> > | statistic_value          | VALUE_SCHEMA not null |          |
>> > | statistic_is_approximate | bool not null         | (3)      |
>> >
>> > 1. If null, then the statistic applies to the entire table.
>> >    It's for "row_count".
>> > 2. We'll provide pre-defined keys such as "max", "min",
>> >    "byte_width" and "distinct_count" but users can also use
>> >    application specific keys.
>> > 3. If true, then the value is approximate or best-effort.
>> >
>> > VALUE_SCHEMA is a dense union with members:
>> >
>> > | Field Name | Field Type |
>> > |------------|------------|
>> > | int64      | int64      |
>> > | uint64     | uint64     |
>> > | float64    | float64    |
>> > | binary     | binary     |
>> >
>> > If a column is an int32 column, it uses int64 for
>> > "max"/"min". We don't provide all types here. Users should
>> > use a compatible type (int64 for a int32 column) instead.
>> >
>> >
>> > Thanks,
>> > --
>> > kou
>> >
>> > In <[email protected]>
>> >   "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May
>> 2024 17:04:57 +0200,
>> >   Antoine Pitrou <[email protected]> wrote:
>> >
>> > >
>> > > Hi Kou,
>> > >
>> > > I agree that Dewey that this is overstretching the capabilities of the
>> > > C Data Interface. In particular, stuffing a pointer as metadata value
>> > > and decreeing it immortal doesn't sound like a good design decision.
>> > >
>> > > Why not simply pass the statistics ArrowArray separately in your
>> > > producer API of choice (Dewey mentioned ADBC but it is of course just
>> > > a possible API among others)?
>> > >
>> > > Regards
>> > >
>> > > Antoine.
>> > >
>> > >
>> > > Le 22/05/2024 à 04:37, Sutou Kouhei a écrit :
>> > >> Hi,
>> > >> We're discussing how to provide statistics through the C
>> > >> data interface at:
>> > >> https://github.com/apache/arrow/issues/38837
>> > >> If you're interested in this feature, could you share your
>> > >> comments?
>> > >> Motivation:
>> > >> We can interchange Apache Arrow data by the C data interface
>> > >> in the same process. For example, we can pass Apache Arrow
>> > >> data read by Apache Arrow C++ (provider) to DuckDB
>> > >> (consumer) through the C data interface.
>> > >> A provider may know Apache Arrow data statistics. For
>> > >> example, a provider can know statistics when it reads Apache
>> > >> Parquet data because Apache Parquet may provide statistics.
>> > >> But a consumer can't know statistics that are known by a
>> > >> producer. Because there isn't a standard way to provide
>> > >> statistics through the C data interface. If a consumer can
>> > >> know statistics, it can process Apache Arrow data faster
>> > >> based on statistics.
>> > >> Proposal:
>> > >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
>> > >> How about providing statistics as a metadata in ArrowSchema?
>> > >> We reserve "ARROW" namespace for internal Apache Arrow use:
>> > >>
>> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
>> > >>
>> > >>> The ARROW pattern is a reserved namespace for internal
>> > >>> Arrow use in the custom_metadata fields. For example,
>> > >>> ARROW:extension:name.
>> > >> So we can use "ARROW:statistics" for the metadata key.
>> > >> We can represent statistics as a ArrowArray like ADBC does.
>> > >> Here is an example ArrowSchema that is for a record batch
>> > >> that has "int32 column1" and "string column2":
>> > >> ArrowSchema {
>> > >>    .format = "+siu",
>> > >>    .metadata = {
>> > >>      "ARROW:statistics" => ArrowArray*, /* table-level statistics
>> such as
>> > >>      row count */
>> > >>    },
>> > >>    .children = {
>> > >>      ArrowSchema {
>> > >>        .name = "column1",
>> > >>        .format = "i",
>> > >>        .metadata = {
>> > >>          "ARROW:statistics" => ArrowArray*, /* column-level
>> statistics such as
>> > >>          count distinct */
>> > >>        },
>> > >>      },
>> > >>      ArrowSchema {
>> > >>        .name = "column2",
>> > >>        .format = "u",
>> > >>        .metadata = {
>> > >>          "ARROW:statistics" => ArrowArray*, /* column-level
>> statistics such as
>> > >>          count distinct */
>> > >>        },
>> > >>      },
>> > >>    },
>> > >> }
>> > >> The metadata value (ArrowArray* part) of '"ARROW:statistics"
>> > >> => ArrowArray*' is a base 10 string of the address of the
>> > >> ArrowArray. Because we can use only string for metadata
>> > >> value. You can't release the statistics ArrowArray*. (Its
>> > >> release is a no-op function.) It follows
>> > >>
>> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
>> > >> semantics. (The base ArrowSchema owns statistics
>> > >> ArrowArray*.)
>> > >> ArrowArray* for statistics use the following schema:
>> > >> | Field Name     | Field Type                       | Comments |
>> > >> |----------------|----------------------------------| -------- |
>> > >> | key            | string not null                  | (1)      |
>> > >> | value          | `VALUE_SCHEMA` not null          |          |
>> > >> | is_approximate | bool not null                    | (2)      |
>> > >> 1. We'll provide pre-defined keys such as "max", "min",
>> > >>     "byte_width" and "distinct_count" but users can also use
>> > >>     application specific keys.
>> > >> 2. If true, then the value is approximate or best-effort.
>> > >> VALUE_SCHEMA is a dense union with members:
>> > >> | Field Name | Field Type                       | Comments |
>> > >> |------------|----------------------------------| -------- |
>> > >> | int64      | int64                            |          |
>> > >> | uint64     | uint64                           |          |
>> > >> | float64    | float64                          |          |
>> > >> | value      | The same type of the ArrowSchema | (3)      |
>> > >> |            | that is belonged to.             |          |
>> > >> 3. If the ArrowSchema's type is string, this type is also string.
>> > >>     TODO: Is "value" good name? If we refer it from the
>> > >>     top-level statistics schema, we need to use
>> > >>     "value.value". It's a bit strange...
>> > >> What do you think about this proposal? Could you share your
>> > >> comments?
>> > >> Thanks,
>>

Re: [DISCUSS] Statistics through the C data interface

Reply via email to