Hi,
> for non-standard statistics from open-source products the key=0
> combined with string label is the way to go
Where do we store the string label?
I think that we're considering the following schema:
>> map<
>> // The column index or null if the statistics refer to whole table or
>> batch.
>> column: int32,
>> // Statistics key is int32.
>> // Different keys are assigned for exact value and
>> // approximate value.
>> map<int32, dense_union<...needed types based on stat kinds in the keys...>>
>> >
Are you considering the following schema for key=0 case?
map<struct<int32, utf8>,
dense_union<...needed types based on stat kinds in the keys...>>
Thanks,
--
kou
In <CAOC8YXYnePq=qfwvzhfqmoxgcubogbhb2gtmabmc7v-x2ap...@mail.gmail.com>
"Re: [DISCUSS] Statistics through the C data interface" on Mon, 1 Jul 2024
11:58:44 -0300,
Felipe Oliveira Carvalho <[email protected]> wrote:
> Hi,
>
> You can promise that well-known int32 statistic keys won't ever be higher
> than a certain value (2^18) [1] like TCP IP ports (well-known ports in [0,
> 2^10)) but for non-standard statistics from open-source products the key=0
> combined with string label is the way to go, otherwise collisions would be
> inevitable and everyone would hate us for having integer keys.
>
> This is not a very weird proposal from my part because integer keys
> representing labels are common in most low-level standardized C APIs (e.g.
> linux syscalls, ioctls, OpenGL, Vulcan...). I expect higher level APIs to
> map all these keys to strings, but with them we keep the "C Data Interface"
> low-level and portable as it should be.
>
> --
> Felipe
>
> [1] 2^16 is too small. For instance, OpenGL constants can't be enums
> because C limits enum to 2^16 and that is *not enough*.
>
> On Thu, Jun 20, 2024 at 7:43 AM Sutou Kouhei <[email protected]> wrote:
>
>> Hi,
>>
>> Here is an updated summary so far:
>>
>> ----
>> Use cases:
>>
>> * Optimize query plan: e.g. JOIN for DuckDB
>>
>> Out of scope:
>>
>> * Transmit statistics through not the C data interface
>> Examples:
>> * Transmit statistics through Apache Arrow IPC file
>> * Transmit statistics through Apache Arrow Flight
>> * Multi-column statistics
>> * Constraints information
>> * Indexes information
>>
>> Discussing approach:
>>
>> Standardize Apache Arrow schema for statistics and transmit
>> statistics via separated API call that uses the C data
>> interface.
>>
>> This also works for per-batch statistics.
>>
>> Candidate schema:
>>
>> map<
>> // The column index or null if the statistics refer to whole table or
>> batch.
>> column: int32,
>> // Statistics key is int32.
>> // Different keys are assigned for exact value and
>> // approximate value.
>> map<int32, dense_union<...needed types based on stat kinds in the
>> keys...>>
>> >
>>
>> Discussions:
>>
>> 1. Can we use int32 for statistic keys?
>> Should we use utf8 (or dictionary<int32, utf8>) for
>> statistic keys?
>> 2. Hot to support non-standard (vendor-specific)
>> statistic keys?
>> ----
>>
>> Here is my idea:
>>
>> 1. We can use int32 for statistic keys.
>> 2. We can reserve a specific range for non-standard
>> statistic keys. Prerequisites of this:
>> * There is no use case to merge some statistics for
>> the same data.
>> * We can't merge statistics for different data.
>>
>> If the prerequisites aren't satisfied:
>>
>> 1. We should use utf8 (or dictionary<int32, utf8>) for
>> statistic keys?
>> 2. We can use reserved prefix such as "ARROW:"/"arrow." for
>> standard statistic keys or use prefix such as
>> "vendor1:"/"vendor1." for non-standard statistic keys.
>>
>> Here is Felipe's idea:
>> https://lists.apache.org/thread/gr2nmlrwr7d5wkz3zgq6vy5q0ow8xof2
>>
>> 1. We can use int32 for statistic keys.
>> 2. We can use the special statistic key + a string identifier
>> for non-standard statistic keys.
>>
>>
>> What do you think about this?
>>
>>
>> Thanks,
>> --
>> kou
>>
>> In <[email protected]>
>> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 06 Jun
>> 2024 18:27:27 +0900 (JST),
>> Sutou Kouhei <[email protected]> wrote:
>>
>> > Hi,
>> >
>> > Thanks for sharing your comments. Here is a summary so far:
>> >
>> > ----
>> >
>> > Use cases:
>> >
>> > * Optimize query plan: e.g. JOIN for DuckDB
>> >
>> > Out of scope:
>> >
>> > * Transmit statistics through not the C data interface
>> > Examples:
>> > * Transmit statistics through Apache Arrow IPC file
>> > * Transmit statistics through Apache Arrow Flight
>> >
>> > Candidate approaches:
>> >
>> > 1. Pass statistics (encoded as an Apache Arrow data) via
>> > ArrowSchema metadata
>> > * This embeds statistics address into metadata
>> > * It's for avoiding using Apache Arrow IPC format with
>> > the C data interface
>> > 2. Embed statistics (encoded as an Apache Arrow data) into
>> > ArrowSchema metadata
>> > * This adds statistics to metadata in Apache Arrow IPC
>> > format
>> > 3. Embed statistics (encoded as JSON) into ArrowArray
>> > metadata
>> > 4. Standardize Apache Arrow schema for statistics and
>> > transmit statistics via separated API call that uses the
>> > C data interface
>> > 5. Use ADBC
>> >
>> > ----
>> >
>> > I think that 4. is the best approach in these candidates.
>> >
>> > 1. Embedding statistics address is tricky.
>> > 2. Consumers need to parse Apache Arrow IPC format data.
>> > (The C data interface consumers may not have the
>> > feature.)
>> > 3. This will work but 4. is more generic.
>> > 5. ADBC is too large to use only for statistics.
>> >
>> > What do you think about this?
>> >
>> >
>> > If we select 4., we need to standardize Apache Arrow schema
>> > for statistics. How about the following schema?
>> >
>> > ----
>> > Metadata:
>> >
>> > | Name | Value | Comments |
>> > |----------------------------|-------|--------- |
>> > | ARROW::statistics::version | 1.0.0 | (1) |
>> >
>> > (1) This follows semantic versioning.
>> >
>> > Fields:
>> >
>> > | Name | Type | Comments |
>> > |----------------|-----------------------| -------- |
>> > | column | utf8 | (2) |
>> > | key | utf8 not null | (3) |
>> > | value | VALUE_SCHEMA not null | |
>> > | is_approximate | bool not null | (4) |
>> >
>> > (2) If null, then the statistic applies to the entire table.
>> > It's for "row_count".
>> > (3) We'll provide pre-defined keys such as "max", "min",
>> > "byte_width" and "distinct_count" but users can also use
>> > application specific keys.
>> > (4) If true, then the value is approximate or best-effort.
>> >
>> > VALUE_SCHEMA is a dense union with members:
>> >
>> > | Name | Type |
>> > |---------|---------|
>> > | int64 | int64 |
>> > | uint64 | uint64 |
>> > | float64 | float64 |
>> > | binary | binary |
>> >
>> > If a column is an int32 column, it uses int64 for
>> > "max"/"min". We don't provide all types here. Users should
>> > use a compatible type (int64 for a int32 column) instead.
>> > ----
>> >
>> >
>> > Thanks,
>> > --
>> > kou
>> >
>> >
>> > In <[email protected]>
>> > "[DISCUSS] Statistics through the C data interface" on Wed, 22 May
>> 2024 11:37:08 +0900 (JST),
>> > Sutou Kouhei <[email protected]> wrote:
>> >
>> >> Hi,
>> >>
>> >> We're discussing how to provide statistics through the C
>> >> data interface at:
>> >> https://github.com/apache/arrow/issues/38837
>> >>
>> >> If you're interested in this feature, could you share your
>> >> comments?
>> >>
>> >>
>> >> Motivation:
>> >>
>> >> We can interchange Apache Arrow data by the C data interface
>> >> in the same process. For example, we can pass Apache Arrow
>> >> data read by Apache Arrow C++ (provider) to DuckDB
>> >> (consumer) through the C data interface.
>> >>
>> >> A provider may know Apache Arrow data statistics. For
>> >> example, a provider can know statistics when it reads Apache
>> >> Parquet data because Apache Parquet may provide statistics.
>> >>
>> >> But a consumer can't know statistics that are known by a
>> >> producer. Because there isn't a standard way to provide
>> >> statistics through the C data interface. If a consumer can
>> >> know statistics, it can process Apache Arrow data faster
>> >> based on statistics.
>> >>
>> >>
>> >> Proposal:
>> >>
>> >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
>> >>
>> >> How about providing statistics as a metadata in ArrowSchema?
>> >>
>> >> We reserve "ARROW" namespace for internal Apache Arrow use:
>> >>
>> >>
>> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
>> >>
>> >>> The ARROW pattern is a reserved namespace for internal
>> >>> Arrow use in the custom_metadata fields. For example,
>> >>> ARROW:extension:name.
>> >>
>> >> So we can use "ARROW:statistics" for the metadata key.
>> >>
>> >> We can represent statistics as a ArrowArray like ADBC does.
>> >>
>> >> Here is an example ArrowSchema that is for a record batch
>> >> that has "int32 column1" and "string column2":
>> >>
>> >> ArrowSchema {
>> >> .format = "+siu",
>> >> .metadata = {
>> >> "ARROW:statistics" => ArrowArray*, /* table-level statistics such
>> as row count */
>> >> },
>> >> .children = {
>> >> ArrowSchema {
>> >> .name = "column1",
>> >> .format = "i",
>> >> .metadata = {
>> >> "ARROW:statistics" => ArrowArray*, /* column-level statistics
>> such as count distinct */
>> >> },
>> >> },
>> >> ArrowSchema {
>> >> .name = "column2",
>> >> .format = "u",
>> >> .metadata = {
>> >> "ARROW:statistics" => ArrowArray*, /* column-level statistics
>> such as count distinct */
>> >> },
>> >> },
>> >> },
>> >> }
>> >>
>> >> The metadata value (ArrowArray* part) of '"ARROW:statistics"
>> >> => ArrowArray*' is a base 10 string of the address of the
>> >> ArrowArray. Because we can use only string for metadata
>> >> value. You can't release the statistics ArrowArray*. (Its
>> >> release is a no-op function.) It follows
>> >>
>> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
>> >> semantics. (The base ArrowSchema owns statistics
>> >> ArrowArray*.)
>> >>
>> >>
>> >> ArrowArray* for statistics use the following schema:
>> >>
>> >> | Field Name | Field Type | Comments |
>> >> |----------------|----------------------------------| -------- |
>> >> | key | string not null | (1) |
>> >> | value | `VALUE_SCHEMA` not null | |
>> >> | is_approximate | bool not null | (2) |
>> >>
>> >> 1. We'll provide pre-defined keys such as "max", "min",
>> >> "byte_width" and "distinct_count" but users can also use
>> >> application specific keys.
>> >>
>> >> 2. If true, then the value is approximate or best-effort.
>> >>
>> >> VALUE_SCHEMA is a dense union with members:
>> >>
>> >> | Field Name | Field Type | Comments |
>> >> |------------|----------------------------------| -------- |
>> >> | int64 | int64 | |
>> >> | uint64 | uint64 | |
>> >> | float64 | float64 | |
>> >> | value | The same type of the ArrowSchema | (3) |
>> >> | | that is belonged to. | |
>> >>
>> >> 3. If the ArrowSchema's type is string, this type is also string.
>> >>
>> >> TODO: Is "value" good name? If we refer it from the
>> >> top-level statistics schema, we need to use
>> >> "value.value". It's a bit strange...
>> >>
>> >>
>> >> What do you think about this proposal? Could you share your
>> >> comments?
>> >>
>> >>
>> >> Thanks,
>> >> --
>> >> kou
>>