Re: [DISCUSS] Statistics through the C data interface

Sutou Kouhei Wed, 22 May 2024 23:21:08 -0700

Hi,

> Why not simply pass the statistics ArrowArray separately in your
> producer API of choice


It seems that we should use the approach because all
feedback said so. How about the following schema for the
statistics ArrowArray? It's based on ADBC.

| Field Name               | Field Type            | Comments |
|--------------------------|-----------------------| -------- |
| column_name              | utf8                  | (1)      |
| statistic_key            | utf8 not null         | (2)      |
| statistic_value          | VALUE_SCHEMA not null |          |
| statistic_is_approximate | bool not null         | (3)      |

1. If null, then the statistic applies to the entire table.
   It's for "row_count".
2. We'll provide pre-defined keys such as "max", "min",
   "byte_width" and "distinct_count" but users can also use
   application specific keys.
3. If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

| Field Name | Field Type |
|------------|------------|
| int64      | int64      |
| uint64     | uint64     |
| float64    | float64    |
| binary     | binary     |

If a column is an int32 column, it uses int64 for
"max"/"min". We don't provide all types here. Users should
use a compatible type (int64 for a int32 column) instead.


Thanks,
-- 
kou

In <a3ce5e96-176c-4226-9d74-6a458317a...@python.org>
  "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May 2024 
17:04:57 +0200,
  Antoine Pitrou <anto...@python.org> wrote:

> 
> Hi Kou,
> 
> I agree that Dewey that this is overstretching the capabilities of the
> C Data Interface. In particular, stuffing a pointer as metadata value
> and decreeing it immortal doesn't sound like a good design decision.
> 
> Why not simply pass the statistics ArrowArray separately in your
> producer API of choice (Dewey mentioned ADBC but it is of course just
> a possible API among others)?
> 
> Regards
> 
> Antoine.
> 
> 
> Le 22/05/2024 à 04:37, Sutou Kouhei a écrit :
>> Hi,
>> We're discussing how to provide statistics through the C
>> data interface at:
>> https://github.com/apache/arrow/issues/38837
>> If you're interested in this feature, could you share your
>> comments?
>> Motivation:
>> We can interchange Apache Arrow data by the C data interface
>> in the same process. For example, we can pass Apache Arrow
>> data read by Apache Arrow C++ (provider) to DuckDB
>> (consumer) through the C data interface.
>> A provider may know Apache Arrow data statistics. For
>> example, a provider can know statistics when it reads Apache
>> Parquet data because Apache Parquet may provide statistics.
>> But a consumer can't know statistics that are known by a
>> producer. Because there isn't a standard way to provide
>> statistics through the C data interface. If a consumer can
>> know statistics, it can process Apache Arrow data faster
>> based on statistics.
>> Proposal:
>> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
>> How about providing statistics as a metadata in ArrowSchema?
>> We reserve "ARROW" namespace for internal Apache Arrow use:
>> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
>> 
>>> The ARROW pattern is a reserved namespace for internal
>>> Arrow use in the custom_metadata fields. For example,
>>> ARROW:extension:name.
>> So we can use "ARROW:statistics" for the metadata key.
>> We can represent statistics as a ArrowArray like ADBC does.
>> Here is an example ArrowSchema that is for a record batch
>> that has "int32 column1" and "string column2":
>> ArrowSchema {
>>    .format = "+siu",
>>    .metadata = {
>>      "ARROW:statistics" => ArrowArray*, /* table-level statistics such as
>>      row count */
>>    },
>>    .children = {
>>      ArrowSchema {
>>        .name = "column1",
>>        .format = "i",
>>        .metadata = {
>>          "ARROW:statistics" => ArrowArray*, /* column-level statistics such 
>> as
>>          count distinct */
>>        },
>>      },
>>      ArrowSchema {
>>        .name = "column2",
>>        .format = "u",
>>        .metadata = {
>>          "ARROW:statistics" => ArrowArray*, /* column-level statistics such 
>> as
>>          count distinct */
>>        },
>>      },
>>    },
>> }
>> The metadata value (ArrowArray* part) of '"ARROW:statistics"
>> => ArrowArray*' is a base 10 string of the address of the
>> ArrowArray. Because we can use only string for metadata
>> value. You can't release the statistics ArrowArray*. (Its
>> release is a no-op function.) It follows
>> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
>> semantics. (The base ArrowSchema owns statistics
>> ArrowArray*.)
>> ArrowArray* for statistics use the following schema:
>> | Field Name     | Field Type                       | Comments |
>> |----------------|----------------------------------| -------- |
>> | key            | string not null                  | (1)      |
>> | value          | `VALUE_SCHEMA` not null          |          |
>> | is_approximate | bool not null                    | (2)      |
>> 1. We'll provide pre-defined keys such as "max", "min",
>>     "byte_width" and "distinct_count" but users can also use
>>     application specific keys.
>> 2. If true, then the value is approximate or best-effort.
>> VALUE_SCHEMA is a dense union with members:
>> | Field Name | Field Type                       | Comments |
>> |------------|----------------------------------| -------- |
>> | int64      | int64                            |          |
>> | uint64     | uint64                           |          |
>> | float64    | float64                          |          |
>> | value      | The same type of the ArrowSchema | (3)      |
>> |            | that is belonged to.             |          |
>> 3. If the ArrowSchema's type is string, this type is also string.
>>     TODO: Is "value" good name? If we refer it from the
>>     top-level statistics schema, we need to use
>>     "value.value". It's a bit strange...
>> What do you think about this proposal? Could you share your
>> comments?
>> Thanks,

Re: [DISCUSS] Statistics through the C data interface

Reply via email to