Re: [DISCUSS] Statistics through the C data interface

Sutou Kouhei Wed, 22 May 2024 23:12:31 -0700

Hi,

I agree with the proposed approach is a departure use of
ArrowSchema.


ADBC may be a bit larger to use only for transmitting
statistics. ADBC has statistics related APIs but it has more
other APIs.


>          It is also not the first time it has come up to encode
> data-dependent information in a schema (e.g., encoding scalar/record
> batch-ness), so perhaps there is a need for another type of array
> stream or descriptor struct?

This may be a candidate approach. Or we may want to add
ArrowArrayStream::get_statistics() callback. But it brakes
ABI...


How about only defining a schema for statistics ArrowArray?
It doesn't define how to get the statistics ArrowArray like
the Arrow C data interface but defining a schema for
statistics ArrowArray will improve how to transmit
statistics.


Thanks,
-- 
kou

In <CAFb7qScyETeHxAo+uHcrqCYe=7qcmkseq+zurzm5ckcjjvu...@mail.gmail.com>
  "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May 2024 
11:38:13 -0300,
  Dewey Dunnington <de...@voltrondata.com.INVALID> wrote:

> I am definitely in favor of adding (or adopting an existing)
> ABI-stable way to transmit statistics (the one that comes up most
> frequently for me is just the number of values that are about to show
> up in an ArrowArrayStream, since the producer often knows this and the
> consumer often would like to preallocate).
> 
> I am skeptical of using the existing C ArrowSchema ABI to do this. The
> ArrowSchema is exceptionally good at representing Arrow data types
> (which in the presence of dictionaries, nested types, and extensions,
> is difficult to do); however, using it to handle all aspects of a
> consumer request/producer response I think dilutes its ability to do
> this well.
> 
> If I'm understanding the proposal (and I may not be), the ArrowSchema
> will be used to encode data-dependent values, which means the same
> ArrowSchema is very tightly paired to a particular array stream (or
> array). This means that one could no longer (e.g.) consume an array
> stream and blindly assign each array in the stream the schema that was
> returned by get_schema(). This is not impossible to work around but it
> is a conceptual departure from the role the ArrowSchema has had in the
> past. Encoding pointers as strings in metadata is also a departure
> from what we have done previously.
> 
> It is possible to condense the boilerplate of an ADBC driver to about
> 10 lines of code [1]. Is there a reason we can't use ADBC (or an
> extension to that standard) to more precisely handle those types of
> requests/responses (and extensions to them that come up in the
> future)? It is also not the first time it has come up to encode
> data-dependent information in a schema (e.g., encoding scalar/record
> batch-ness), so perhaps there is a need for another type of array
> stream or descriptor struct?
> 
> [1] 
> https://github.com/apache/arrow-adbc/blob/a40cf88408d6cb776cedeaa4d1d0945675c156cc/c/driver/common/driver_test.cc#L56-L66
> 
> On Wed, May 22, 2024 at 8:15 AM Raphael Taylor-Davies
> <r.taylordav...@googlemail.com.invalid> wrote:
>>
>> Hi,
>>
>> One potential challenge with encoding statistics in the schema metadata
>> is that some systems may consider this metadata as part of assessing
>> schema equivalence.
>>
>> However, I think the bigger question is what the intended use-case for
>> these statistics is? Often query engines want to collect statistics from
>> multiple containers in one go, as this allows for efficient vectorised
>> pruning across multiple files, row groups, etc... I therefore wonder if
>> the solution is simply to return separate arrays of min, max, etc...
>> potentially even grouped together into a single StructArray?
>>
>> This would have the benefit of not needing specification changes, whilst
>> being significantly more efficient than an approach centered on scalar
>> statistics. FWIW this is the approach taken by DataFusion for pruning
>> statistics [1], and in arrow-rs we represent scalars as arrays to avoid
>> needing to define a parallel serialization standard [2].
>>
>> Kind Regards,
>>
>> Raphael
>>
>> [1]:
>> https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html
>> [2]: https://github.com/apache/arrow-rs/pull/4393
>>
>> On 22/05/2024 03:37, Sutou Kouhei wrote:
>> > Hi,
>> >
>> > We're discussing how to provide statistics through the C
>> > data interface at:
>> > https://github.com/apache/arrow/issues/38837
>> >
>> > If you're interested in this feature, could you share your
>> > comments?
>> >
>> >
>> > Motivation:
>> >
>> > We can interchange Apache Arrow data by the C data interface
>> > in the same process. For example, we can pass Apache Arrow
>> > data read by Apache Arrow C++ (provider) to DuckDB
>> > (consumer) through the C data interface.
>> >
>> > A provider may know Apache Arrow data statistics. For
>> > example, a provider can know statistics when it reads Apache
>> > Parquet data because Apache Parquet may provide statistics.
>> >
>> > But a consumer can't know statistics that are known by a
>> > producer. Because there isn't a standard way to provide
>> > statistics through the C data interface. If a consumer can
>> > know statistics, it can process Apache Arrow data faster
>> > based on statistics.
>> >
>> >
>> > Proposal:
>> >
>> > https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
>> >
>> > How about providing statistics as a metadata in ArrowSchema?
>> >
>> > We reserve "ARROW" namespace for internal Apache Arrow use:
>> >
>> > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
>> >
>> >> The ARROW pattern is a reserved namespace for internal
>> >> Arrow use in the custom_metadata fields. For example,
>> >> ARROW:extension:name.
>> > So we can use "ARROW:statistics" for the metadata key.
>> >
>> > We can represent statistics as a ArrowArray like ADBC does.
>> >
>> > Here is an example ArrowSchema that is for a record batch
>> > that has "int32 column1" and "string column2":
>> >
>> > ArrowSchema {
>> >    .format = "+siu",
>> >    .metadata = {
>> >      "ARROW:statistics" => ArrowArray*, /* table-level statistics such as 
>> > row count */
>> >    },
>> >    .children = {
>> >      ArrowSchema {
>> >        .name = "column1",
>> >        .format = "i",
>> >        .metadata = {
>> >          "ARROW:statistics" => ArrowArray*, /* column-level statistics 
>> > such as count distinct */
>> >        },
>> >      },
>> >      ArrowSchema {
>> >        .name = "column2",
>> >        .format = "u",
>> >        .metadata = {
>> >          "ARROW:statistics" => ArrowArray*, /* column-level statistics 
>> > such as count distinct */
>> >        },
>> >      },
>> >    },
>> > }
>> >
>> > The metadata value (ArrowArray* part) of '"ARROW:statistics"
>> > => ArrowArray*' is a base 10 string of the address of the
>> > ArrowArray. Because we can use only string for metadata
>> > value. You can't release the statistics ArrowArray*. (Its
>> > release is a no-op function.) It follows
>> > https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
>> > semantics. (The base ArrowSchema owns statistics
>> > ArrowArray*.)
>> >
>> >
>> > ArrowArray* for statistics use the following schema:
>> >
>> > | Field Name     | Field Type                       | Comments |
>> > |----------------|----------------------------------| -------- |
>> > | key            | string not null                  | (1)      |
>> > | value          | `VALUE_SCHEMA` not null          |          |
>> > | is_approximate | bool not null                    | (2)      |
>> >
>> > 1. We'll provide pre-defined keys such as "max", "min",
>> >     "byte_width" and "distinct_count" but users can also use
>> >     application specific keys.
>> >
>> > 2. If true, then the value is approximate or best-effort.
>> >
>> > VALUE_SCHEMA is a dense union with members:
>> >
>> > | Field Name | Field Type                       | Comments |
>> > |------------|----------------------------------| -------- |
>> > | int64      | int64                            |          |
>> > | uint64     | uint64                           |          |
>> > | float64    | float64                          |          |
>> > | value      | The same type of the ArrowSchema | (3)      |
>> > |            | that is belonged to.             |          |
>> >
>> > 3. If the ArrowSchema's type is string, this type is also string.
>> >
>> >     TODO: Is "value" good name? If we refer it from the
>> >     top-level statistics schema, we need to use
>> >     "value.value". It's a bit strange...
>> >
>> >
>> > What do you think about this proposal? Could you share your
>> > comments?
>> >
>> >
>> > Thanks,

Re: [DISCUSS] Statistics through the C data interface

Reply via email to