Re: [DISCUSS] Statistics through the C data interface

Sutou Kouhei Wed, 22 May 2024 22:47:32 -0700

Hi,

> One potential challenge with encoding statistics in the schema
> metadata is that some systems may consider this metadata as part of
> assessing schema equivalence.


It's a good point. I didn't notice it. The proposed approach
makes schemas different because they have addresses of
ArrowArray for statistics not contents of ArrowArray for
statistics.

> However, I think the bigger question is what the intended use-case for
> these statistics is?

Ah, sorry. I should have also explained it in the initial
e-mail. The intended use-case is for query planning. See the
DuckDB use case in the issue:

https://github.com/apache/arrow/issues/38837#issuecomment-2031914284

>                                                              I
> therefore wonder if the solution is simply to return separate arrays
> of min, max, etc... potentially even grouped together into a single
> StructArray?

Yes. It's a candidate approach. I also considered similar
approach that is based of ADBC. It's a bit abstracted than
the separated arrays/grouped StructArray because we can
represent not only pre-defined statistics (min, max, etc...)
but also application-specific statistics:
https://github.com/apache/arrow/issues/38837#issuecomment-2074371230

But we can't provide a cross-language API with this
approach. Apache Arrow C++ will provide
arrow::RecordBatch::statistics() or something but Apache
Arrow Rust will use different API. So I proposed the C data
interface based approach.

We'll define only schema for statistics record batch with
the ADBC like approach or the grouped StructArray
approach. Is it sufficient for sharing statistics widely?


>                    FWIW this is the approach taken by DataFusion for
> pruning statistics [1],

Thanks for sharing this. It seems that DataFusion supports
only pre-defined statistics. Does DataFusion ever require
application-specific statistics? I'm not sure whether we
should support application-specific statistics or not.


Thanks,
-- 
kou


In <fc7ae05d-f420-4c74-8c01-b77e9a256...@googlemail.com>
  "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May 2024 
12:14:40 +0100,
  Raphael Taylor-Davies <r.taylordav...@googlemail.com.INVALID> wrote:

> Hi,
> 
> One potential challenge with encoding statistics in the schema
> metadata is that some systems may consider this metadata as part of
> assessing schema equivalence.
> 
> However, I think the bigger question is what the intended use-case for
> these statistics is? Often query engines want to collect statistics
> from multiple containers in one go, as this allows for efficient
> vectorised pruning across multiple files, row groups, etc... I
> therefore wonder if the solution is simply to return separate arrays
> of min, max, etc... potentially even grouped together into a single
> StructArray?
> 
> This would have the benefit of not needing specification changes,
> whilst being significantly more efficient than an approach centered on
> scalar statistics. FWIW this is the approach taken by DataFusion for
> pruning statistics [1], and in arrow-rs we represent scalars as arrays
> to avoid needing to define a parallel serialization standard [2].
> 
> Kind Regards,
> 
> Raphael
> 
> [1]:
> https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html
> [2]: https://github.com/apache/arrow-rs/pull/4393
> 
> On 22/05/2024 03:37, Sutou Kouhei wrote:
>> Hi,
>>
>> We're discussing how to provide statistics through the C
>> data interface at:
>> https://github.com/apache/arrow/issues/38837
>>
>> If you're interested in this feature, could you share your
>> comments?
>>
>>
>> Motivation:
>>
>> We can interchange Apache Arrow data by the C data interface
>> in the same process. For example, we can pass Apache Arrow
>> data read by Apache Arrow C++ (provider) to DuckDB
>> (consumer) through the C data interface.
>>
>> A provider may know Apache Arrow data statistics. For
>> example, a provider can know statistics when it reads Apache
>> Parquet data because Apache Parquet may provide statistics.
>>
>> But a consumer can't know statistics that are known by a
>> producer. Because there isn't a standard way to provide
>> statistics through the C data interface. If a consumer can
>> know statistics, it can process Apache Arrow data faster
>> based on statistics.
>>
>>
>> Proposal:
>>
>> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
>>
>> How about providing statistics as a metadata in ArrowSchema?
>>
>> We reserve "ARROW" namespace for internal Apache Arrow use:
>>
>> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
>>
>>> The ARROW pattern is a reserved namespace for internal
>>> Arrow use in the custom_metadata fields. For example,
>>> ARROW:extension:name.
>> So we can use "ARROW:statistics" for the metadata key.
>>
>> We can represent statistics as a ArrowArray like ADBC does.
>>
>> Here is an example ArrowSchema that is for a record batch
>> that has "int32 column1" and "string column2":
>>
>> ArrowSchema {
>>    .format = "+siu",
>>    .metadata = {
>>      "ARROW:statistics" => ArrowArray*, /* table-level statistics such as
>>      row count */
>>    },
>>    .children = {
>>      ArrowSchema {
>>        .name = "column1",
>>        .format = "i",
>>        .metadata = {
>>          "ARROW:statistics" => ArrowArray*, /* column-level statistics such 
>> as
>>          count distinct */
>>        },
>>      },
>>      ArrowSchema {
>>        .name = "column2",
>>        .format = "u",
>>        .metadata = {
>>          "ARROW:statistics" => ArrowArray*, /* column-level statistics such 
>> as
>>          count distinct */
>>        },
>>      },
>>    },
>> }
>>
>> The metadata value (ArrowArray* part) of '"ARROW:statistics"
>> => ArrowArray*' is a base 10 string of the address of the
>> ArrowArray. Because we can use only string for metadata
>> value. You can't release the statistics ArrowArray*. (Its
>> release is a no-op function.) It follows
>> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
>> semantics. (The base ArrowSchema owns statistics
>> ArrowArray*.)
>>
>>
>> ArrowArray* for statistics use the following schema:
>>
>> | Field Name     | Field Type                       | Comments |
>> |----------------|----------------------------------| -------- |
>> | key            | string not null                  | (1)      |
>> | value          | `VALUE_SCHEMA` not null          |          |
>> | is_approximate | bool not null                    | (2)      |
>>
>> 1. We'll provide pre-defined keys such as "max", "min",
>>     "byte_width" and "distinct_count" but users can also use
>>     application specific keys.
>>
>> 2. If true, then the value is approximate or best-effort.
>>
>> VALUE_SCHEMA is a dense union with members:
>>
>> | Field Name | Field Type                       | Comments |
>> |------------|----------------------------------| -------- |
>> | int64      | int64                            |          |
>> | uint64     | uint64                           |          |
>> | float64    | float64                          |          |
>> | value      | The same type of the ArrowSchema | (3)      |
>> |            | that is belonged to.             |          |
>>
>> 3. If the ArrowSchema's type is string, this type is also string.
>>
>>     TODO: Is "value" good name? If we refer it from the
>>     top-level statistics schema, we need to use
>>     "value.value". It's a bit strange...
>>
>>
>> What do you think about this proposal? Could you share your
>> comments?
>>
>>
>> Thanks,

Re: [DISCUSS] Statistics through the C data interface

Reply via email to