Re: [DISCUSS] Statistics through the C data interface

Sutou Kouhei Thu, 20 Jun 2024 02:25:07 -0700

Hi,

Here is an updated summary so far:


----
Use cases:

* Optimize query plan: e.g. JOIN for DuckDB

Out of scope:

* Transmit statistics through not the C data interface
  Examples:
  * Transmit statistics through Apache Arrow IPC file
  * Transmit statistics through Apache Arrow Flight
* Multi-column statistics
* Constraints information
* Indexes information

Discussing approach:

Standardize Apache Arrow schema for statistics and transmit
statistics via separated API call that uses the C data
interface.

This also works for per-batch statistics.

Candidate schema:

map<
  // The column index or null if the statistics refer to whole table or batch.
  column: int32,
  // Statistics key is int32.
  // Different keys are assigned for exact value and
  // approximate value.
  map<int32, dense_union<...needed types based on stat kinds in the keys...>>
>

Discussions:

1. Can we use int32 for statistic keys?
   Should we use utf8 (or dictionary<int32, utf8>) for
   statistic keys?
2. Hot to support non-standard (vendor-specific)
   statistic keys?
----

Here is my idea:

1. We can use int32 for statistic keys.
2. We can reserve a specific range for non-standard
   statistic keys. Prerequisites of this:
   * There is no use case to merge some statistics for
     the same data.
   * We can't merge statistics for different data.

If the prerequisites aren't satisfied:

1. We should use utf8 (or dictionary<int32, utf8>) for
   statistic keys?
2. We can use reserved prefix such as "ARROW:"/"arrow." for
   standard statistic keys or use prefix such as
   "vendor1:"/"vendor1." for non-standard statistic keys.

Here is Felipe's idea:
https://lists.apache.org/thread/gr2nmlrwr7d5wkz3zgq6vy5q0ow8xof2

1. We can use int32 for statistic keys.
2. We can use the special statistic key + a string identifier
   for non-standard statistic keys.


What do you think about this?


Thanks,
-- 
kou

In <20240606.182727.1004633558059795207....@clear-code.com>
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 06 Jun 2024 
18:27:27 +0900 (JST),
  Sutou Kouhei <k...@clear-code.com> wrote:

> Hi,
> 
> Thanks for sharing your comments. Here is a summary so far:
> 
> ----
> 
> Use cases:
> 
> * Optimize query plan: e.g. JOIN for DuckDB
> 
> Out of scope:
> 
> * Transmit statistics through not the C data interface
>   Examples:
>   * Transmit statistics through Apache Arrow IPC file
>   * Transmit statistics through Apache Arrow Flight
> 
> Candidate approaches:
> 
> 1. Pass statistics (encoded as an Apache Arrow data) via
>    ArrowSchema metadata
>    * This embeds statistics address into metadata
>    * It's for avoiding using Apache Arrow IPC format with
>      the C data interface
> 2. Embed statistics (encoded as an Apache Arrow data) into
>    ArrowSchema metadata
>    * This adds statistics to metadata in Apache Arrow IPC
>      format
> 3. Embed statistics (encoded as JSON) into ArrowArray
>    metadata
> 4. Standardize Apache Arrow schema for statistics and
>    transmit statistics via separated API call that uses the
>    C data interface
> 5. Use ADBC
> 
> ----
> 
> I think that 4. is the best approach in these candidates.
> 
> 1. Embedding statistics address is tricky.
> 2. Consumers need to parse Apache Arrow IPC format data.
>    (The C data interface consumers may not have the
>    feature.)
> 3. This will work but 4. is more generic.
> 5. ADBC is too large to use only for statistics.
> 
> What do you think about this?
> 
> 
> If we select 4., we need to standardize Apache Arrow schema
> for statistics. How about the following schema?
> 
> ----
> Metadata:
> 
> | Name                       | Value | Comments |
> |----------------------------|-------|--------- |
> | ARROW::statistics::version | 1.0.0 | (1)      |
> 
> (1) This follows semantic versioning.
> 
> Fields:
> 
> | Name           | Type                  | Comments |
> |----------------|-----------------------| -------- |
> | column         | utf8                  | (2)      |
> | key            | utf8 not null         | (3)      |
> | value          | VALUE_SCHEMA not null |          |
> | is_approximate | bool not null         | (4)      |
> 
> (2) If null, then the statistic applies to the entire table.
>     It's for "row_count".
> (3) We'll provide pre-defined keys such as "max", "min",
>     "byte_width" and "distinct_count" but users can also use
>     application specific keys.
> (4) If true, then the value is approximate or best-effort.
> 
> VALUE_SCHEMA is a dense union with members:
> 
> | Name    | Type    |
> |---------|---------|
> | int64   | int64   |
> | uint64  | uint64  |
> | float64 | float64 |
> | binary  | binary  |
> 
> If a column is an int32 column, it uses int64 for
> "max"/"min". We don't provide all types here. Users should
> use a compatible type (int64 for a int32 column) instead.
> ----
> 
> 
> Thanks,
> -- 
> kou
> 
> 
> In <20240522.113708.2023905028549001143....@clear-code.com>
>   "[DISCUSS] Statistics through the C data interface" on Wed, 22 May 2024 
> 11:37:08 +0900 (JST),
>   Sutou Kouhei <k...@clear-code.com> wrote:
> 
>> Hi,
>> 
>> We're discussing how to provide statistics through the C
>> data interface at:
>> https://github.com/apache/arrow/issues/38837
>> 
>> If you're interested in this feature, could you share your
>> comments?
>> 
>> 
>> Motivation:
>> 
>> We can interchange Apache Arrow data by the C data interface
>> in the same process. For example, we can pass Apache Arrow
>> data read by Apache Arrow C++ (provider) to DuckDB
>> (consumer) through the C data interface.
>> 
>> A provider may know Apache Arrow data statistics. For
>> example, a provider can know statistics when it reads Apache
>> Parquet data because Apache Parquet may provide statistics.
>> 
>> But a consumer can't know statistics that are known by a
>> producer. Because there isn't a standard way to provide
>> statistics through the C data interface. If a consumer can
>> know statistics, it can process Apache Arrow data faster
>> based on statistics.
>> 
>> 
>> Proposal:
>> 
>> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
>> 
>> How about providing statistics as a metadata in ArrowSchema?
>> 
>> We reserve "ARROW" namespace for internal Apache Arrow use:
>> 
>> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
>> 
>>> The ARROW pattern is a reserved namespace for internal
>>> Arrow use in the custom_metadata fields. For example,
>>> ARROW:extension:name.
>> 
>> So we can use "ARROW:statistics" for the metadata key.
>> 
>> We can represent statistics as a ArrowArray like ADBC does.
>> 
>> Here is an example ArrowSchema that is for a record batch
>> that has "int32 column1" and "string column2":
>> 
>> ArrowSchema {
>>   .format = "+siu",
>>   .metadata = {
>>     "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row 
>> count */
>>   },
>>   .children = {
>>     ArrowSchema {
>>       .name = "column1",
>>       .format = "i",
>>       .metadata = {
>>         "ARROW:statistics" => ArrowArray*, /* column-level statistics such 
>> as count distinct */
>>       },
>>     },
>>     ArrowSchema {
>>       .name = "column2",
>>       .format = "u",
>>       .metadata = {
>>         "ARROW:statistics" => ArrowArray*, /* column-level statistics such 
>> as count distinct */
>>       },
>>     },
>>   },
>> }
>> 
>> The metadata value (ArrowArray* part) of '"ARROW:statistics"
>> => ArrowArray*' is a base 10 string of the address of the
>> ArrowArray. Because we can use only string for metadata
>> value. You can't release the statistics ArrowArray*. (Its
>> release is a no-op function.) It follows
>> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
>> semantics. (The base ArrowSchema owns statistics
>> ArrowArray*.)
>> 
>> 
>> ArrowArray* for statistics use the following schema:
>> 
>> | Field Name     | Field Type                       | Comments |
>> |----------------|----------------------------------| -------- |
>> | key            | string not null                  | (1)      |
>> | value          | `VALUE_SCHEMA` not null          |          |
>> | is_approximate | bool not null                    | (2)      |
>> 
>> 1. We'll provide pre-defined keys such as "max", "min",
>>    "byte_width" and "distinct_count" but users can also use
>>    application specific keys.
>> 
>> 2. If true, then the value is approximate or best-effort.
>> 
>> VALUE_SCHEMA is a dense union with members:
>> 
>> | Field Name | Field Type                       | Comments |
>> |------------|----------------------------------| -------- |
>> | int64      | int64                            |          |
>> | uint64     | uint64                           |          |
>> | float64    | float64                          |          |
>> | value      | The same type of the ArrowSchema | (3)      |
>> |            | that is belonged to.             |          |
>> 
>> 3. If the ArrowSchema's type is string, this type is also string.
>> 
>>    TODO: Is "value" good name? If we refer it from the
>>    top-level statistics schema, we need to use
>>    "value.value". It's a bit strange...
>> 
>> 
>> What do you think about this proposal? Could you share your
>> comments?
>> 
>> 
>> Thanks,
>> -- 
>> kou

Re: [DISCUSS] Statistics through the C data interface

Reply via email to