Re: [DISCUSS] Statistics through the C data interface

Dewey Dunnington Wed, 22 May 2024 07:38:41 -0700

I am definitely in favor of adding (or adopting an existing)
ABI-stable way to transmit statistics (the one that comes up most
frequently for me is just the number of values that are about to show
up in an ArrowArrayStream, since the producer often knows this and the
consumer often would like to preallocate).


I am skeptical of using the existing C ArrowSchema ABI to do this. The
ArrowSchema is exceptionally good at representing Arrow data types
(which in the presence of dictionaries, nested types, and extensions,
is difficult to do); however, using it to handle all aspects of a
consumer request/producer response I think dilutes its ability to do
this well.

If I'm understanding the proposal (and I may not be), the ArrowSchema
will be used to encode data-dependent values, which means the same
ArrowSchema is very tightly paired to a particular array stream (or
array). This means that one could no longer (e.g.) consume an array
stream and blindly assign each array in the stream the schema that was
returned by get_schema(). This is not impossible to work around but it
is a conceptual departure from the role the ArrowSchema has had in the
past. Encoding pointers as strings in metadata is also a departure
from what we have done previously.

It is possible to condense the boilerplate of an ADBC driver to about
10 lines of code [1]. Is there a reason we can't use ADBC (or an
extension to that standard) to more precisely handle those types of
requests/responses (and extensions to them that come up in the
future)? It is also not the first time it has come up to encode
data-dependent information in a schema (e.g., encoding scalar/record
batch-ness), so perhaps there is a need for another type of array
stream or descriptor struct?

[1] 
https://github.com/apache/arrow-adbc/blob/a40cf88408d6cb776cedeaa4d1d0945675c156cc/c/driver/common/driver_test.cc#L56-L66

On Wed, May 22, 2024 at 8:15 AM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:
>
> Hi,
>
> One potential challenge with encoding statistics in the schema metadata
> is that some systems may consider this metadata as part of assessing
> schema equivalence.
>
> However, I think the bigger question is what the intended use-case for
> these statistics is? Often query engines want to collect statistics from
> multiple containers in one go, as this allows for efficient vectorised
> pruning across multiple files, row groups, etc... I therefore wonder if
> the solution is simply to return separate arrays of min, max, etc...
> potentially even grouped together into a single StructArray?
>
> This would have the benefit of not needing specification changes, whilst
> being significantly more efficient than an approach centered on scalar
> statistics. FWIW this is the approach taken by DataFusion for pruning
> statistics [1], and in arrow-rs we represent scalars as arrays to avoid
> needing to define a parallel serialization standard [2].
>
> Kind Regards,
>
> Raphael
>
> [1]:
> https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html
> [2]: https://github.com/apache/arrow-rs/pull/4393
>
> On 22/05/2024 03:37, Sutou Kouhei wrote:
> > Hi,
> >
> > We're discussing how to provide statistics through the C
> > data interface at:
> > https://github.com/apache/arrow/issues/38837
> >
> > If you're interested in this feature, could you share your
> > comments?
> >
> >
> > Motivation:
> >
> > We can interchange Apache Arrow data by the C data interface
> > in the same process. For example, we can pass Apache Arrow
> > data read by Apache Arrow C++ (provider) to DuckDB
> > (consumer) through the C data interface.
> >
> > A provider may know Apache Arrow data statistics. For
> > example, a provider can know statistics when it reads Apache
> > Parquet data because Apache Parquet may provide statistics.
> >
> > But a consumer can't know statistics that are known by a
> > producer. Because there isn't a standard way to provide
> > statistics through the C data interface. If a consumer can
> > know statistics, it can process Apache Arrow data faster
> > based on statistics.
> >
> >
> > Proposal:
> >
> > https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
> >
> > How about providing statistics as a metadata in ArrowSchema?
> >
> > We reserve "ARROW" namespace for internal Apache Arrow use:
> >
> > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
> >
> >> The ARROW pattern is a reserved namespace for internal
> >> Arrow use in the custom_metadata fields. For example,
> >> ARROW:extension:name.
> > So we can use "ARROW:statistics" for the metadata key.
> >
> > We can represent statistics as a ArrowArray like ADBC does.
> >
> > Here is an example ArrowSchema that is for a record batch
> > that has "int32 column1" and "string column2":
> >
> > ArrowSchema {
> >    .format = "+siu",
> >    .metadata = {
> >      "ARROW:statistics" => ArrowArray*, /* table-level statistics such as 
> > row count */
> >    },
> >    .children = {
> >      ArrowSchema {
> >        .name = "column1",
> >        .format = "i",
> >        .metadata = {
> >          "ARROW:statistics" => ArrowArray*, /* column-level statistics such 
> > as count distinct */
> >        },
> >      },
> >      ArrowSchema {
> >        .name = "column2",
> >        .format = "u",
> >        .metadata = {
> >          "ARROW:statistics" => ArrowArray*, /* column-level statistics such 
> > as count distinct */
> >        },
> >      },
> >    },
> > }
> >
> > The metadata value (ArrowArray* part) of '"ARROW:statistics"
> > => ArrowArray*' is a base 10 string of the address of the
> > ArrowArray. Because we can use only string for metadata
> > value. You can't release the statistics ArrowArray*. (Its
> > release is a no-op function.) It follows
> > https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
> > semantics. (The base ArrowSchema owns statistics
> > ArrowArray*.)
> >
> >
> > ArrowArray* for statistics use the following schema:
> >
> > | Field Name     | Field Type                       | Comments |
> > |----------------|----------------------------------| -------- |
> > | key            | string not null                  | (1)      |
> > | value          | `VALUE_SCHEMA` not null          |          |
> > | is_approximate | bool not null                    | (2)      |
> >
> > 1. We'll provide pre-defined keys such as "max", "min",
> >     "byte_width" and "distinct_count" but users can also use
> >     application specific keys.
> >
> > 2. If true, then the value is approximate or best-effort.
> >
> > VALUE_SCHEMA is a dense union with members:
> >
> > | Field Name | Field Type                       | Comments |
> > |------------|----------------------------------| -------- |
> > | int64      | int64                            |          |
> > | uint64     | uint64                           |          |
> > | float64    | float64                          |          |
> > | value      | The same type of the ArrowSchema | (3)      |
> > |            | that is belonged to.             |          |
> >
> > 3. If the ArrowSchema's type is string, this type is also string.
> >
> >     TODO: Is "value" good name? If we refer it from the
> >     top-level statistics schema, we need to use
> >     "value.value". It's a bit strange...
> >
> >
> > What do you think about this proposal? Could you share your
> > comments?
> >
> >
> > Thanks,

Re: [DISCUSS] Statistics through the C data interface

Reply via email to