Re: [DISCUSS] Statistics through the C data interface

Shoumyo Chakravorti (BLOOMBERG/ 120 PARK) Tue, 28 May 2024 17:16:10 -0700

Thanks for addressing the feedback! I didn't know that an
Arrow IPC `Message` (not just Schema) could also contain
`custom_metadata` -- thanks for pointing it out.


> Based on the list, how about standardizing both of the
> followings for statistics?
>
> 1. Apache Arrow schema for statistics that is used by
>    separated statistics getter API
> 2. "ARROW:statistics" metadata format that can be used in
>    Apache Arrow schema metadata
>
> Users can use 1. and/or 2. based on their use cases.

This sounds good to me. Using JSON to represent the metadata
for #2 also sounds reasonable. I think elsewhere on this
thread, Weston mentioned that we could alternatively use
the schema defined for #1 and directly use that to encode
the schema metadata as an Arrow IPC RecordBatch:

> This has been something that has always been desired for the Arrow IPC
> format too.
>
> My preference would be (apologies if this has been mentioned before):
>
> - Agree on how statistics should be encoded into an array (this is not
>   hard, we just have to agree on the field order and the data type for
>   null_count)
> - If you need statistics in the schema then simply encode the 1-row batch
>   into an IPC buffer (using the streaming format) or maybe just an IPC
>   RecordBatch message since the schema is fixed and store those bytes in the
>   schema

This would avoid having to define a separate "schema" for
the JSON metadata, but might be more effort to work with in
certain contexts (e.g. a library that currently only needs the
C data interface would now also have to learn how to parse
Arrow IPC).

If we do go down the JSON route, how about something like
this to avoid defining the keys for all possible statistics up
front:

  Schema {
    custom_metadata: {
      "ARROW:statistics" => "[ { \"key\": \"row_count\", \"value\": 29, 
\"value_type\": \"uint64\", \"is_approximate\": false } ]"
    }
  }

It's more verbose, but more closely mirrors the Arrow array
schema defined for statistics getter APIs. This could make it
easier to translate between the two.

Thanks,
Shoumyo

From: [email protected] At: 05/26/24 21:48:52 UTC-4:00To:  
[email protected]
Subject: Re: [DISCUSS] Statistics through the C data interface

>Hi,

>

>> To start, data might be sourced in various manners:

>> 

>> - Arrow IPC files may be mapped from shared memory

>> - Arrow IPC streams may be received via some RPC framework (à la Flight)

>> - The Arrow libraries may be used to read from file formats like Parquet or 

>CSV

>> - ADBC drivers may be used to read from databases

>

>Thanks for listing it.

>

>Regarding to the first case:

>

>Using schema metadata may be a reasonable approach because

>the Arrow data will be on the page cache. There is no

>significant read cost. We don't need to read statistics

>before the Arrow data is ready.

>

>But if the Arrow data will not be produced based on

>statistics of the Arrow data, separated statistics get API

>may be better.

>

>Regarding to the second case:

>

>Schema metadata is an approach for it but we can choose

>other approaches for this case. For example, Flight has

>FlightData::app_metadata[1] and Arrow IPC message has

>custom_metadata[2] as Dewey mentioned.

>

>[1] 

>https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/fo

>rmat/Flight.proto#L512-L515

>[2] 

>https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/fo

>rmat/Message.fbs#L154

>

>Regarding to the third case:

>

>Reader objects will provide statistics. For example,

>parquet::ColumnChunkMetaData::statistics()

>(parquet::ParquetFileReader::metadata()->RowGroup(X)->ColumnChunk(Y)->statistics

>())

>will provide statistics.

>

>Regarding to the forth case:

>

>We can use ADBC API.

>

>

>Based on the list, how about standardizing both of the

>followings for statistics?

>

>1. Apache Arrow schema for statistics that is used by

>   separated statistics getter API

>2. "ARROW:statistics" metadata format that can be used in

>   Apache Arrow schema metadata

>

>Users can use 1. and/or 2. based on their use cases.

>

>Regarding to 2.: How about the following?

>

>This uses Field::custom_metadata[3] and

>Schema::custom_metadata[4].

>

>[3] https://github.com/apache/arrow/blob/main/format/Schema.fbs#L528-L529

>[4] 

>https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/fo

>rmat/Schema.fbs#L563-L564

>

>"ARROW:statistics" in Field::custom_metadata represents

>column-level statistics. It uses JSON like we did for

>"ARROW:extension:metadata"[5]. Here is an example:

>

>  Field {

>    custom_metadata: {

>      "ARROW:statistics" => "{\"max\": 1, \"distinct_count\": 29}"

>    }

>  }

>

>(JSON may not be able to represent complex information but

>is it needed for statistics?)

>

>"ARROW:statistics" in Schema::custom_metadata represents

>table-level statistics. It uses JSON like we did for

>"ARROW:extension:metadata"[5]. Here is an example:

>

>  Schema {

>    custom_metadata: {

>      "ARROW:statistics" => "{\"row_count\": 29}"

>    }

>  }

>

>TODO: Define the JSON content details. For example, we need

>to define keys such as "distinct_count" and "row_count".

>

>

>[5] 

>https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-typ

>es

>

>

>

>Thanks,

>-- 

>kou

>

>In <[email protected]>

>  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 2024 

>14:28:43 -0000,

>  "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)" <[email protected]> 

>wrote:

>

>> This is a really exciting development, thank you for putting together this 

>proposal!

>> 

>> It looks like this thread and the linked GitHub issue has lots of input from 

>folks who work with Arrow at a low level and have better familiarity with the 

>Arrow specifications than I do, so I'll refrain from commenting on the 

>technicalities of the proposal. I would, however, like to share my perspective 

>as an application developer that heavily uses Arrow at higher levels for 

>composing data systems.

>> 

>> My main concern with the direction of this proposal is that it seems too 

>narrowly focused on what the integration with DuckDB will look like (how the 

>statistics can be fed into DuckDB). In many applications, executing the query 

>is often the "last mile", and it's important to consider where the statistics 

>will actually come from. To start, data might be sourced in various manners:

>> 

>> - Arrow IPC files may be mapped from shared memory

>> - Arrow IPC streams may be received via some RPC framework (à la Flight)

>> - The Arrow libraries may be used to read from file formats like Parquet or 

>CSV

>> - ADBC drivers may be used to read from databases

>> 

>> Note that in at least the first two cases, the system _executing the query_ 

>will not be able to provide statistics simply because it is not actually the 

>data producer. As an example, if Process A writes an Arrow IPC file to shared 

>memory, and Process B wants to run a query on it -- how is Process B supposed 

>to get the statistics for query planning? There are a few approaches that I 

>anticipate application developers might consider:

>> 

>> 1. Design an out-of-band mechanism for Process B to fetch statistics from 

>Process A.

>> 2. Design an encoding that is a superset of Arrow IPC and includes 
>> statistics 

>information, allowing statistics to be communicated in-band.

>> 3. Use custom schema metadata to communicate statistics in-band.

>> 

>> Options 1 and 2 require considerably more effort than Option 3. Also, Option 

>3 feels somewhat natural because it makes sense for the statistics to come 
>with 

>the data (similar to how statistics are embedded in Parquet files). In some 

>sense, the statistics actually *are* a property of the stream.

>> 

>> In systems that I work on, we already use schema metadata to communicate 

>information that is unrelated to the structure of the data. From my reading of 

>the documentation [1], this sounds like a reasonable (and perhaps intended?) 

>use of metadata, and nowhere is it mentioned that metadata must be used to 

>determine schema equivalence. Unless there are other ways of producing 

>stream-level application metadata outside of the schema/field metadata, the 

>lack of purity was not a concern for me to begin with.

>> 

>> I would appreciate an approach that communicates statistics via schema 

>metadata, or at least in some in-band fashion that is consistent across the 
>IPC 

>and C data specifications. This would make it much easier to uniformly and 

>transparently plumb statistics through applications, regardless of where they 

>source Arrow data from. As developers are likely to create bespoke conventions 

>for this anyways, it seems reasonable to standardize it as canonical metadata.

>> 

>> I say this all as a happy user of DuckDB's Arrow scan functionality that is 

>excited to see better query optimization capabilities. It's just that, in its 

>current form, the changes in this proposal are not something I could 

>foreseeably integrate with.

>> 

>> Best,

>> Shoumyo

>> 

>> [1]: 

>https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata

>> 

>> From: [email protected] At: 05/23/24 10:10:51 UTC-4:00To:  

>[email protected]

>> Subject: Re: [DISCUSS] Statistics through the C data interface

>> 

>> I want to +1 on what Dewey is saying here and some comments.

>> 

>> Sutou Kouhei wrote:

>>> ADBC may be a bit larger to use only for transmitting statistics. ADBC has 

>> statistics related APIs but it has more other APIs.

>> 

>> It's impossible to keep the responsibility of communication protocols

>> cleanly separated, but IMO, we should strive to keep the C Data

>> Interface more of a Transport Protocol than an Application Protocol.

>> 

>> Statistics are application dependent and can complicate the

>> implementation of importers/exporters which would hinder the adoption

>> of the C Data Interface. Statistics also bring in security concerns

>> that are application-specific. e.g. can an algorithm trust min/max

>> stats and risk producing incorrect results if the statistics are

>> incorrect? A question that can't really be answered at the C Data

>> Interface level.

>> 

>> The need for more sophisticated statistics only grows with time, so

>> there is no such thing as a "simple statistics schema".

>> 

>> Protocols that produce/consume statistics might want to use the C Data

>> Interface as a primitive for passing Arrow arrays of statistics.

>> 

>> ADBC might be too big of a leap in complexity now, but "we just need C

>> Data Interface + statistics" is unlikely to remain true for very long

>> as projects grow in complexity.

>> 

>> --

>> Felipe

>> 

>> On Thu, May 23, 2024 at 9:57 AM Dewey Dunnington

>> <[email protected]> wrote:

>>>

>>> Thank you for the background! I understand that these statistics are

>>> important for query planning; however, I am not sure that I follow why

>>> we are constrained to the ArrowSchema to represent them. The examples

>>> given seem to going through Python...would it be easier to request

>>> statistics at a higher level of abstraction? There would already need

>>> to be a separate mechanism to request an ArrowArrayStream with

>>> statistics (unless the PyCapsule `requested_schema` argument would

>>> suffice).

>>>

>>> > ADBC may be a bit larger to use only for transmitting

>>> > statistics. ADBC has statistics related APIs but it has more

>>> > other APIs.

>>>

>>> Some examples of producers given in the linked threads (Delta Lake,

>>> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One

>>> can implement an ADBC driver without defining all the methods (where

>>> the producer could call AdbcConnectionGetStatistics(), although

>>> AdbcStatementGetStatistics() might be more relevant here and doesn't

>>> exist). One example listed (using an Arrow Table as a source) seems a

>>> bit light to wrap in an ADBC driver; however, it would not take much

>>> code to do so and the overhead of getting the reader via ADBC it is

>>> something like 100 microseconds (tested via the ADBC R package's

>>> "monkey driver" which wraps an existing stream as a statement). In any

>>> case, the bulk of the code is building the statistics array.

>>>

>>> > How about the following schema for the

>>> > statistics ArrowArray? It's based on ADBC.

>>>

>>> Whatever format for statistics is decided on, I imagine it should be

>>> exactly the same as the ADBC standard? (Perhaps pushing changes

>>> upstream if needed?).

>>>

>>> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei <[email protected]> wrote:

>>> >

>>> > Hi,

>>> >

>>> > > Why not simply pass the statistics ArrowArray separately in your

>>> > > producer API of choice

>>> >

>>> > It seems that we should use the approach because all

>>> > feedback said so. How about the following schema for the

>>> > statistics ArrowArray? It's based on ADBC.

>>> >

>>> > | Field Name               | Field Type            | Comments |

>>> > |--------------------------|-----------------------| -------- |

>>> > | column_name              | utf8                  | (1)      |

>>> > | statistic_key            | utf8 not null         | (2)      |

>>> > | statistic_value          | VALUE_SCHEMA not null |          |

>>> > | statistic_is_approximate | bool not null         | (3)      |

>>> >

>>> > 1. If null, then the statistic applies to the entire table.

>>> >    It's for "row_count".

>>> > 2. We'll provide pre-defined keys such as "max", "min",

>>> >    "byte_width" and "distinct_count" but users can also use

>>> >    application specific keys.

>>> > 3. If true, then the value is approximate or best-effort.

>>> >

>>> > VALUE_SCHEMA is a dense union with members:

>>> >

>>> > | Field Name | Field Type |

>>> > |------------|------------|

>>> > | int64      | int64      |

>>> > | uint64     | uint64     |

>>> > | float64    | float64    |

>>> > | binary     | binary     |

>>> >

>>> > If a column is an int32 column, it uses int64 for

>>> > "max"/"min". We don't provide all types here. Users should

>>> > use a compatible type (int64 for a int32 column) instead.

>>> >

>>> >

>>> > Thanks,

>>> > --

>>> > kou

>>> >

>>> > In <[email protected]>

>>> >   "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May 

>> 2024 17:04:57 +0200,

>>> >   Antoine Pitrou <[email protected]> wrote:

>>> >

>>> > >

>>> > > Hi Kou,

>>> > >

>>> > > I agree that Dewey that this is overstretching the capabilities of the

>>> > > C Data Interface. In particular, stuffing a pointer as metadata value

>>> > > and decreeing it immortal doesn't sound like a good design decision.

>>> > >

>>> > > Why not simply pass the statistics ArrowArray separately in your

>>> > > producer API of choice (Dewey mentioned ADBC but it is of course just

>>> > > a possible API among others)?

>>> > >

>>> > > Regards

>>> > >

>>> > > Antoine.

>>> > >

>>> > >

>>> > > Le 22/05/2024 à 04:37, Sutou Kouhei a écrit :

>>> > >> Hi,

>>> > >> We're discussing how to provide statistics through the C

>>> > >> data interface at:

>>> > >> https://github.com/apache/arrow/issues/38837

>>> > >> If you're interested in this feature, could you share your

>>> > >> comments?

>>> > >> Motivation:

>>> > >> We can interchange Apache Arrow data by the C data interface

>>> > >> in the same process. For example, we can pass Apache Arrow

>>> > >> data read by Apache Arrow C++ (provider) to DuckDB

>>> > >> (consumer) through the C data interface.

>>> > >> A provider may know Apache Arrow data statistics. For

>>> > >> example, a provider can know statistics when it reads Apache

>>> > >> Parquet data because Apache Parquet may provide statistics.

>>> > >> But a consumer can't know statistics that are known by a

>>> > >> producer. Because there isn't a standard way to provide

>>> > >> statistics through the C data interface. If a consumer can

>>> > >> know statistics, it can process Apache Arrow data faster

>>> > >> based on statistics.

>>> > >> Proposal:

>>> > >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784

>>> > >> How about providing statistics as a metadata in ArrowSchema?

>>> > >> We reserve "ARROW" namespace for internal Apache Arrow use:

>>> > >> 

>> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata

>>> > >>

>>> > >>> The ARROW pattern is a reserved namespace for internal

>>> > >>> Arrow use in the custom_metadata fields. For example,

>>> > >>> ARROW:extension:name.

>>> > >> So we can use "ARROW:statistics" for the metadata key.

>>> > >> We can represent statistics as a ArrowArray like ADBC does.

>>> > >> Here is an example ArrowSchema that is for a record batch

>>> > >> that has "int32 column1" and "string column2":

>>> > >> ArrowSchema {

>>> > >>    .format = "+siu",

>>> > >>    .metadata = {

>>> > >>      "ARROW:statistics" => ArrowArray*, /* table-level statistics such 

>as

>>> > >>      row count */

>>> > >>    },

>>> > >>    .children = {

>>> > >>      ArrowSchema {

>>> > >>        .name = "column1",

>>> > >>        .format = "i",

>>> > >>        .metadata = {

>>> > >>          "ARROW:statistics" => ArrowArray*, /* column-level statistics 

>> such as

>>> > >>          count distinct */

>>> > >>        },

>>> > >>      },

>>> > >>      ArrowSchema {

>>> > >>        .name = "column2",

>>> > >>        .format = "u",

>>> > >>        .metadata = {

>>> > >>          "ARROW:statistics" => ArrowArray*, /* column-level statistics 

>> such as

>>> > >>          count distinct */

>>> > >>        },

>>> > >>      },

>>> > >>    },

>>> > >> }

>>> > >> The metadata value (ArrowArray* part) of '"ARROW:statistics"

>>> > >> => ArrowArray*' is a base 10 string of the address of the

>>> > >> ArrowArray. Because we can use only string for metadata

>>> > >> value. You can't release the statistics ArrowArray*. (Its

>>> > >> release is a no-op function.) It follows

>>> > >> 

>> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation

>>> > >> semantics. (The base ArrowSchema owns statistics

>>> > >> ArrowArray*.)

>>> > >> ArrowArray* for statistics use the following schema:

>>> > >> | Field Name     | Field Type                       | Comments |

>>> > >> |----------------|----------------------------------| -------- |

>>> > >> | key            | string not null                  | (1)      |

>>> > >> | value          | `VALUE_SCHEMA` not null          |          |

>>> > >> | is_approximate | bool not null                    | (2)      |

>>> > >> 1. We'll provide pre-defined keys such as "max", "min",

>>> > >>     "byte_width" and "distinct_count" but users can also use

>>> > >>     application specific keys.

>>> > >> 2. If true, then the value is approximate or best-effort.

>>> > >> VALUE_SCHEMA is a dense union with members:

>>> > >> | Field Name | Field Type                       | Comments |

>>> > >> |------------|----------------------------------| -------- |

>>> > >> | int64      | int64                            |          |

>>> > >> | uint64     | uint64                           |          |

>>> > >> | float64    | float64                          |          |

>>> > >> | value      | The same type of the ArrowSchema | (3)      |

>>> > >> |            | that is belonged to.             |          |

>>> > >> 3. If the ArrowSchema's type is string, this type is also string.

>>> > >>     TODO: Is "value" good name? If we refer it from the

>>> > >>     top-level statistics schema, we need to use

>>> > >>     "value.value". It's a bit strange...

>>> > >> What do you think about this proposal? Could you share your

>>> > >> comments?

>>> > >> Thanks,

>> 

>>

Re: [DISCUSS] Statistics through the C data interface

Reply via email to