Hi,

You can promise that well-known int32 statistic keys won't ever be higher
than a certain value (2^18) [1] like TCP IP ports (well-known ports in [0,
2^10)) but for non-standard statistics from open-source products the key=0
combined with string label is the way to go, otherwise collisions would be
inevitable and everyone would hate us for having integer keys.

This is not a very weird proposal from my part because integer keys
representing labels are common in most low-level standardized C APIs (e.g.
linux syscalls, ioctls, OpenGL, Vulcan...). I expect higher level APIs to
map all these keys to strings, but with them we keep the "C Data Interface"
low-level and portable as it should be.

--
Felipe

[1] 2^16 is too small. For instance, OpenGL constants can't be enums
because C limits enum to 2^16 and that is *not enough*.

On Thu, Jun 20, 2024 at 7:43 AM Sutou Kouhei <k...@clear-code.com> wrote:

> Hi,
>
> Here is an updated summary so far:
>
> ----
> Use cases:
>
> * Optimize query plan: e.g. JOIN for DuckDB
>
> Out of scope:
>
> * Transmit statistics through not the C data interface
>   Examples:
>   * Transmit statistics through Apache Arrow IPC file
>   * Transmit statistics through Apache Arrow Flight
> * Multi-column statistics
> * Constraints information
> * Indexes information
>
> Discussing approach:
>
> Standardize Apache Arrow schema for statistics and transmit
> statistics via separated API call that uses the C data
> interface.
>
> This also works for per-batch statistics.
>
> Candidate schema:
>
> map<
>   // The column index or null if the statistics refer to whole table or
> batch.
>   column: int32,
>   // Statistics key is int32.
>   // Different keys are assigned for exact value and
>   // approximate value.
>   map<int32, dense_union<...needed types based on stat kinds in the
> keys...>>
> >
>
> Discussions:
>
> 1. Can we use int32 for statistic keys?
>    Should we use utf8 (or dictionary<int32, utf8>) for
>    statistic keys?
> 2. Hot to support non-standard (vendor-specific)
>    statistic keys?
> ----
>
> Here is my idea:
>
> 1. We can use int32 for statistic keys.
> 2. We can reserve a specific range for non-standard
>    statistic keys. Prerequisites of this:
>    * There is no use case to merge some statistics for
>      the same data.
>    * We can't merge statistics for different data.
>
> If the prerequisites aren't satisfied:
>
> 1. We should use utf8 (or dictionary<int32, utf8>) for
>    statistic keys?
> 2. We can use reserved prefix such as "ARROW:"/"arrow." for
>    standard statistic keys or use prefix such as
>    "vendor1:"/"vendor1." for non-standard statistic keys.
>
> Here is Felipe's idea:
> https://lists.apache.org/thread/gr2nmlrwr7d5wkz3zgq6vy5q0ow8xof2
>
> 1. We can use int32 for statistic keys.
> 2. We can use the special statistic key + a string identifier
>    for non-standard statistic keys.
>
>
> What do you think about this?
>
>
> Thanks,
> --
> kou
>
> In <20240606.182727.1004633558059795207....@clear-code.com>
>   "Re: [DISCUSS] Statistics through the C data interface" on Thu, 06 Jun
> 2024 18:27:27 +0900 (JST),
>   Sutou Kouhei <k...@clear-code.com> wrote:
>
> > Hi,
> >
> > Thanks for sharing your comments. Here is a summary so far:
> >
> > ----
> >
> > Use cases:
> >
> > * Optimize query plan: e.g. JOIN for DuckDB
> >
> > Out of scope:
> >
> > * Transmit statistics through not the C data interface
> >   Examples:
> >   * Transmit statistics through Apache Arrow IPC file
> >   * Transmit statistics through Apache Arrow Flight
> >
> > Candidate approaches:
> >
> > 1. Pass statistics (encoded as an Apache Arrow data) via
> >    ArrowSchema metadata
> >    * This embeds statistics address into metadata
> >    * It's for avoiding using Apache Arrow IPC format with
> >      the C data interface
> > 2. Embed statistics (encoded as an Apache Arrow data) into
> >    ArrowSchema metadata
> >    * This adds statistics to metadata in Apache Arrow IPC
> >      format
> > 3. Embed statistics (encoded as JSON) into ArrowArray
> >    metadata
> > 4. Standardize Apache Arrow schema for statistics and
> >    transmit statistics via separated API call that uses the
> >    C data interface
> > 5. Use ADBC
> >
> > ----
> >
> > I think that 4. is the best approach in these candidates.
> >
> > 1. Embedding statistics address is tricky.
> > 2. Consumers need to parse Apache Arrow IPC format data.
> >    (The C data interface consumers may not have the
> >    feature.)
> > 3. This will work but 4. is more generic.
> > 5. ADBC is too large to use only for statistics.
> >
> > What do you think about this?
> >
> >
> > If we select 4., we need to standardize Apache Arrow schema
> > for statistics. How about the following schema?
> >
> > ----
> > Metadata:
> >
> > | Name                       | Value | Comments |
> > |----------------------------|-------|--------- |
> > | ARROW::statistics::version | 1.0.0 | (1)      |
> >
> > (1) This follows semantic versioning.
> >
> > Fields:
> >
> > | Name           | Type                  | Comments |
> > |----------------|-----------------------| -------- |
> > | column         | utf8                  | (2)      |
> > | key            | utf8 not null         | (3)      |
> > | value          | VALUE_SCHEMA not null |          |
> > | is_approximate | bool not null         | (4)      |
> >
> > (2) If null, then the statistic applies to the entire table.
> >     It's for "row_count".
> > (3) We'll provide pre-defined keys such as "max", "min",
> >     "byte_width" and "distinct_count" but users can also use
> >     application specific keys.
> > (4) If true, then the value is approximate or best-effort.
> >
> > VALUE_SCHEMA is a dense union with members:
> >
> > | Name    | Type    |
> > |---------|---------|
> > | int64   | int64   |
> > | uint64  | uint64  |
> > | float64 | float64 |
> > | binary  | binary  |
> >
> > If a column is an int32 column, it uses int64 for
> > "max"/"min". We don't provide all types here. Users should
> > use a compatible type (int64 for a int32 column) instead.
> > ----
> >
> >
> > Thanks,
> > --
> > kou
> >
> >
> > In <20240522.113708.2023905028549001143....@clear-code.com>
> >   "[DISCUSS] Statistics through the C data interface" on Wed, 22 May
> 2024 11:37:08 +0900 (JST),
> >   Sutou Kouhei <k...@clear-code.com> wrote:
> >
> >> Hi,
> >>
> >> We're discussing how to provide statistics through the C
> >> data interface at:
> >> https://github.com/apache/arrow/issues/38837
> >>
> >> If you're interested in this feature, could you share your
> >> comments?
> >>
> >>
> >> Motivation:
> >>
> >> We can interchange Apache Arrow data by the C data interface
> >> in the same process. For example, we can pass Apache Arrow
> >> data read by Apache Arrow C++ (provider) to DuckDB
> >> (consumer) through the C data interface.
> >>
> >> A provider may know Apache Arrow data statistics. For
> >> example, a provider can know statistics when it reads Apache
> >> Parquet data because Apache Parquet may provide statistics.
> >>
> >> But a consumer can't know statistics that are known by a
> >> producer. Because there isn't a standard way to provide
> >> statistics through the C data interface. If a consumer can
> >> know statistics, it can process Apache Arrow data faster
> >> based on statistics.
> >>
> >>
> >> Proposal:
> >>
> >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
> >>
> >> How about providing statistics as a metadata in ArrowSchema?
> >>
> >> We reserve "ARROW" namespace for internal Apache Arrow use:
> >>
> >>
> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
> >>
> >>> The ARROW pattern is a reserved namespace for internal
> >>> Arrow use in the custom_metadata fields. For example,
> >>> ARROW:extension:name.
> >>
> >> So we can use "ARROW:statistics" for the metadata key.
> >>
> >> We can represent statistics as a ArrowArray like ADBC does.
> >>
> >> Here is an example ArrowSchema that is for a record batch
> >> that has "int32 column1" and "string column2":
> >>
> >> ArrowSchema {
> >>   .format = "+siu",
> >>   .metadata = {
> >>     "ARROW:statistics" => ArrowArray*, /* table-level statistics such
> as row count */
> >>   },
> >>   .children = {
> >>     ArrowSchema {
> >>       .name = "column1",
> >>       .format = "i",
> >>       .metadata = {
> >>         "ARROW:statistics" => ArrowArray*, /* column-level statistics
> such as count distinct */
> >>       },
> >>     },
> >>     ArrowSchema {
> >>       .name = "column2",
> >>       .format = "u",
> >>       .metadata = {
> >>         "ARROW:statistics" => ArrowArray*, /* column-level statistics
> such as count distinct */
> >>       },
> >>     },
> >>   },
> >> }
> >>
> >> The metadata value (ArrowArray* part) of '"ARROW:statistics"
> >> => ArrowArray*' is a base 10 string of the address of the
> >> ArrowArray. Because we can use only string for metadata
> >> value. You can't release the statistics ArrowArray*. (Its
> >> release is a no-op function.) It follows
> >>
> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
> >> semantics. (The base ArrowSchema owns statistics
> >> ArrowArray*.)
> >>
> >>
> >> ArrowArray* for statistics use the following schema:
> >>
> >> | Field Name     | Field Type                       | Comments |
> >> |----------------|----------------------------------| -------- |
> >> | key            | string not null                  | (1)      |
> >> | value          | `VALUE_SCHEMA` not null          |          |
> >> | is_approximate | bool not null                    | (2)      |
> >>
> >> 1. We'll provide pre-defined keys such as "max", "min",
> >>    "byte_width" and "distinct_count" but users can also use
> >>    application specific keys.
> >>
> >> 2. If true, then the value is approximate or best-effort.
> >>
> >> VALUE_SCHEMA is a dense union with members:
> >>
> >> | Field Name | Field Type                       | Comments |
> >> |------------|----------------------------------| -------- |
> >> | int64      | int64                            |          |
> >> | uint64     | uint64                           |          |
> >> | float64    | float64                          |          |
> >> | value      | The same type of the ArrowSchema | (3)      |
> >> |            | that is belonged to.             |          |
> >>
> >> 3. If the ArrowSchema's type is string, this type is also string.
> >>
> >>    TODO: Is "value" good name? If we refer it from the
> >>    top-level statistics schema, we need to use
> >>    "value.value". It's a bit strange...
> >>
> >>
> >> What do you think about this proposal? Could you share your
> >> comments?
> >>
> >>
> >> Thanks,
> >> --
> >> kou
>

Reply via email to