Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Felipe Oliveira Carvalho
I've been thinking about how to encode statistics on Arrow arrays and
how to keep the set of statistics known by both producers and
consumers (i.e. standardized).

The statistics array(s) could be a

  map<
// the column index or null if the statistics refer to whole table or batch
column: int32,
map>
  >

The keys would be defined as part of the standard:

// Statistics values are identified by specified int32-valued keys
// so that producers and consumers can agree on physical
// encoding and semantics. Statistics can be about a column,
// a record batch, or both.
typedef ArrowStatKind int32_t;

#define ARROW_STAT_ANY 0
// Exact number of nulls in a column. Value must be int32 or int64.
#define ARROW_STAT_NULL_COUNT_EXACT 1
// Approximate number of nulls in a column. Value must be float32 or float64.
#define ARROW_STAT_NULL_COUNT_APPROX 2
// The minimum and maximum values of a column.
// Value must be the same type of the column.
// Supported types are: ...
#define ARROW_STAT_MIN_APROX 2
#define ARROW_STAT_MIN_NULLS_FIRST 4
#define ARROW_STAT_MIN_NULLS_LAST 5
#define ARROW_STAT_MAX_APROX 6
#define ARROW_STAT_MAX_NULLS_FIRST 7
#define ARROW_STAT_MAX_NULLS_LAST 8
#define ARROW_STAT_CARDINALITY_APPROX 9
#define ARROW_STAT_COUNT_DISTINCT_APPROX 10

Every key is optional and consumers that don't know or don't care
about the stats can skip them while scanning statistics arrays.

Applications would have their own domain classes for storing
statistics (e.g. DuckDB's BaseStatistics [1]) and a way to pack and
unpack into these arrays.

The exact types inside the dense_union would be chosen when encoding.
The decoder would handle the types expected and/or supported for each
given stat kind.

We wouldn't have to rely on versioning of the entire statistics
objects. If we want a richer way to represent a maximum, we add
another stat kind to the spec and keep producing both the old and the
new representations for the maximum while consumers migrate to the new
way. Version markers in two-sided protocols never work well long term:
see Parquet files lying about the version of the encoder so the files
can be read and web browsers lying on their User-Agent strings so
websites don't break. It's better to allow probing for individual
feature support (in this case, the presence of a specific stat kind in
the array).

Multiple calls could be done to load statistics and they could come
with more statistics each time.

--
Felipe

[1] 
https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/base_statistics.hpp#L38-L146

On Thu, Jun 6, 2024 at 10:07 PM Dewey Dunnington
 wrote:
>
> Thank you for collecting all of our opinions on this! I also agree
> that (4) is the best option.
>
> > Fields:
> >
> > | Name   | Type  | Comments |
> > ||---|  |
> > | column | utf8  | (2)  |
>
> The uft8 type would presume that column names are unique (although I
> like it better than referring to columns by integer position).
>
> > If null, then the statistic applies to the entire table.
>
> Perhaps the NULL column value could also be used for the other
> statistics in addition to a row count if the array is not a struct
> array?
>
>
> On Thu, Jun 6, 2024 at 6:42 AM Antoine Pitrou  wrote:
> >
> >
> > Hi Kou,
> >
> > Thanks for pushing for this!
> >
> > Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
> > > 4. Standardize Apache Arrow schema for statistics and
> > > transmit statistics via separated API call that uses the
> > > C data interface
> > [...]
> > >
> > > I think that 4. is the best approach in these candidates.
> >
> > I agree.
> >
> > > If we select 4., we need to standardize Apache Arrow schema
> > > for statistics. How about the following schema?
> > >
> > > 
> > > Metadata:
> > >
> > > | Name   | Value | Comments |
> > > ||---|- |
> > > | ARROW::statistics::version | 1.0.0 | (1)  |
> >
> > I'm not sure this is useful, but it doesn't hurt.
> >
> > Nit: this should be "ARROW:statistics:version" for consistency with
> > https://arrow.apache.org/docs/format/Columnar.html#extension-types
> >
> > > Fields:
> > >
> > > | Name   | Type  | Comments |
> > > ||---|  |
> > > | column | utf8  | (2)  |
> > > | key| utf8 not null | (3)  |
> >
> > 1. Should the key be something like `dictionary(int32, utf8)` to make
> > the representation more efficient where there are many columns?
> >
> > 2. Should the statistics perhaps be nested as a map type under each
> > column to avoid repeating `column`, or is that overkill?
> >
> > 3. Should there also be room for multi-column statistics (such as
> > cardinality of a given column pair), or is it too complex for now?
> >
> > Regards
> >
> > Antoine.


Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Dewey Dunnington
Thank you for collecting all of our opinions on this! I also agree
that (4) is the best option.

> Fields:
>
> | Name   | Type  | Comments |
> ||---|  |
> | column | utf8  | (2)  |

The uft8 type would presume that column names are unique (although I
like it better than referring to columns by integer position).

> If null, then the statistic applies to the entire table.

Perhaps the NULL column value could also be used for the other
statistics in addition to a row count if the array is not a struct
array?


On Thu, Jun 6, 2024 at 6:42 AM Antoine Pitrou  wrote:
>
>
> Hi Kou,
>
> Thanks for pushing for this!
>
> Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
> > 4. Standardize Apache Arrow schema for statistics and
> > transmit statistics via separated API call that uses the
> > C data interface
> [...]
> >
> > I think that 4. is the best approach in these candidates.
>
> I agree.
>
> > If we select 4., we need to standardize Apache Arrow schema
> > for statistics. How about the following schema?
> >
> > 
> > Metadata:
> >
> > | Name   | Value | Comments |
> > ||---|- |
> > | ARROW::statistics::version | 1.0.0 | (1)  |
>
> I'm not sure this is useful, but it doesn't hurt.
>
> Nit: this should be "ARROW:statistics:version" for consistency with
> https://arrow.apache.org/docs/format/Columnar.html#extension-types
>
> > Fields:
> >
> > | Name   | Type  | Comments |
> > ||---|  |
> > | column | utf8  | (2)  |
> > | key| utf8 not null | (3)  |
>
> 1. Should the key be something like `dictionary(int32, utf8)` to make
> the representation more efficient where there are many columns?
>
> 2. Should the statistics perhaps be nested as a map type under each
> column to avoid repeating `column`, or is that overkill?
>
> 3. Should there also be room for multi-column statistics (such as
> cardinality of a given column pair), or is it too complex for now?
>
> Regards
>
> Antoine.


[RESULT][VOTE][RUST] Release Apache Arrow Rust 52.0.0 RC1

2024-06-06 Thread Raphael Taylor-Davies

With 8 +1 votes (7 binding), including myself, the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-rs-52.0.0/


It has also been released to crates.io

Thank you to everyone who helped verify this release

Raphael

On 03/06/2024 17:03, Raphael Taylor-Davies wrote:

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 52.0.0.


This release candidate is based on commit: 
f42218ae5d9c9f0b9ea3365f2b1e6025a43b8c71 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

I vote +1 (binding) on this release

[1]: 
https://github.com/apache/arrow-rs/tree/f42218ae5d9c9f0b9ea3365f2b1e6025a43b8c71
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-52.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/f42218ae5d9c9f0b9ea3365f2b1e6025a43b8c71/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Alessandro Molina
I brought it up on Github, but writing here too to avoid spawning too many
threads.
https://github.com/apache/arrow/issues/38837#issuecomment-2145343755

It's not something we have to address now, but it would be great if we
could design a solution that can be extended in the future to add Par-Batch
statistics in ArrowArrayStream.

While it's true that in most cases the producer code will be applying the
filtering, in the case of C-Data we can't take that for granted. There
might be cases where the consumer has no control over the filtering that
the producer would apply and the producer might not be aware of the
filtering that the consumer might want to do.

In those cases providing the statistics per-batch would allow the consumer
to skip the batches it doesn't care about, thus giving the opportunity for
a fast path.





On Thu, Jun 6, 2024 at 11:42 AM Antoine Pitrou  wrote:

>
> Hi Kou,
>
> Thanks for pushing for this!
>
> Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
> > 4. Standardize Apache Arrow schema for statistics and
> > transmit statistics via separated API call that uses the
> > C data interface
> [...]
> >
> > I think that 4. is the best approach in these candidates.
>
> I agree.
>
> > If we select 4., we need to standardize Apache Arrow schema
> > for statistics. How about the following schema?
> >
> > 
> > Metadata:
> >
> > | Name   | Value | Comments |
> > ||---|- |
> > | ARROW::statistics::version | 1.0.0 | (1)  |
>
> I'm not sure this is useful, but it doesn't hurt.
>
> Nit: this should be "ARROW:statistics:version" for consistency with
> https://arrow.apache.org/docs/format/Columnar.html#extension-types
>
> > Fields:
> >
> > | Name   | Type  | Comments |
> > ||---|  |
> > | column | utf8  | (2)  |
> > | key| utf8 not null | (3)  |
>
> 1. Should the key be something like `dictionary(int32, utf8)` to make
> the representation more efficient where there are many columns?
>
> 2. Should the statistics perhaps be nested as a map type under each
> column to avoid repeating `column`, or is that overkill?
>
> 3. Should there also be room for multi-column statistics (such as
> cardinality of a given column pair), or is it too complex for now?
>
> Regards
>
> Antoine.
>


Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Antoine Pitrou



Hi Kou,

Thanks for pushing for this!

Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :

4. Standardize Apache Arrow schema for statistics and
transmit statistics via separated API call that uses the
C data interface

[...]


I think that 4. is the best approach in these candidates.


I agree.


If we select 4., we need to standardize Apache Arrow schema
for statistics. How about the following schema?


Metadata:

| Name   | Value | Comments |
||---|- |
| ARROW::statistics::version | 1.0.0 | (1)  |


I'm not sure this is useful, but it doesn't hurt.

Nit: this should be "ARROW:statistics:version" for consistency with 
https://arrow.apache.org/docs/format/Columnar.html#extension-types



Fields:

| Name   | Type  | Comments |
||---|  |
| column | utf8  | (2)  |
| key| utf8 not null | (3)  |


1. Should the key be something like `dictionary(int32, utf8)` to make 
the representation more efficient where there are many columns?


2. Should the statistics perhaps be nested as a map type under each 
column to avoid repeating `column`, or is that overkill?


3. Should there also be room for multi-column statistics (such as 
cardinality of a given column pair), or is it too complex for now?


Regards

Antoine.


Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Sutou Kouhei
Hi,

Thanks for sharing your comments. Here is a summary so far:



Use cases:

* Optimize query plan: e.g. JOIN for DuckDB

Out of scope:

* Transmit statistics through not the C data interface
  Examples:
  * Transmit statistics through Apache Arrow IPC file
  * Transmit statistics through Apache Arrow Flight

Candidate approaches:

1. Pass statistics (encoded as an Apache Arrow data) via
   ArrowSchema metadata
   * This embeds statistics address into metadata
   * It's for avoiding using Apache Arrow IPC format with
 the C data interface
2. Embed statistics (encoded as an Apache Arrow data) into
   ArrowSchema metadata
   * This adds statistics to metadata in Apache Arrow IPC
 format
3. Embed statistics (encoded as JSON) into ArrowArray
   metadata
4. Standardize Apache Arrow schema for statistics and
   transmit statistics via separated API call that uses the
   C data interface
5. Use ADBC



I think that 4. is the best approach in these candidates.

1. Embedding statistics address is tricky.
2. Consumers need to parse Apache Arrow IPC format data.
   (The C data interface consumers may not have the
   feature.)
3. This will work but 4. is more generic.
5. ADBC is too large to use only for statistics.

What do you think about this?


If we select 4., we need to standardize Apache Arrow schema
for statistics. How about the following schema?


Metadata:

| Name   | Value | Comments |
||---|- |
| ARROW::statistics::version | 1.0.0 | (1)  |

(1) This follows semantic versioning.

Fields:

| Name   | Type  | Comments |
||---|  |
| column | utf8  | (2)  |
| key| utf8 not null | (3)  |
| value  | VALUE_SCHEMA not null |  |
| is_approximate | bool not null | (4)  |

(2) If null, then the statistic applies to the entire table.
It's for "row_count".
(3) We'll provide pre-defined keys such as "max", "min",
"byte_width" and "distinct_count" but users can also use
application specific keys.
(4) If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

| Name| Type|
|-|-|
| int64   | int64   |
| uint64  | uint64  |
| float64 | float64 |
| binary  | binary  |

If a column is an int32 column, it uses int64 for
"max"/"min". We don't provide all types here. Users should
use a compatible type (int64 for a int32 column) instead.



Thanks,
-- 
kou


In <20240522.113708.2023905028549001143@clear-code.com>
  "[DISCUSS] Statistics through the C data interface" on Wed, 22 May 2024 
11:37:08 +0900 (JST),
  Sutou Kouhei  wrote:

> Hi,
> 
> We're discussing how to provide statistics through the C
> data interface at:
> https://github.com/apache/arrow/issues/38837
> 
> If you're interested in this feature, could you share your
> comments?
> 
> 
> Motivation:
> 
> We can interchange Apache Arrow data by the C data interface
> in the same process. For example, we can pass Apache Arrow
> data read by Apache Arrow C++ (provider) to DuckDB
> (consumer) through the C data interface.
> 
> A provider may know Apache Arrow data statistics. For
> example, a provider can know statistics when it reads Apache
> Parquet data because Apache Parquet may provide statistics.
> 
> But a consumer can't know statistics that are known by a
> producer. Because there isn't a standard way to provide
> statistics through the C data interface. If a consumer can
> know statistics, it can process Apache Arrow data faster
> based on statistics.
> 
> 
> Proposal:
> 
> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
> 
> How about providing statistics as a metadata in ArrowSchema?
> 
> We reserve "ARROW" namespace for internal Apache Arrow use:
> 
> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
> 
>> The ARROW pattern is a reserved namespace for internal
>> Arrow use in the custom_metadata fields. For example,
>> ARROW:extension:name.
> 
> So we can use "ARROW:statistics" for the metadata key.
> 
> We can represent statistics as a ArrowArray like ADBC does.
> 
> Here is an example ArrowSchema that is for a record batch
> that has "int32 column1" and "string column2":
> 
> ArrowSchema {
>   .format = "+siu",
>   .metadata = {
> "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row 
> count */
>   },
>   .children = {
> ArrowSchema {
>   .name = "column1",
>   .format = "i",
>   .metadata = {
> "ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
> count distinct */
>   },
> },
> ArrowSchema {
>   .name = "column2",
>   .format = "u",
>   .metadata = {
> "ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
> count distinct */
>   },
> },
>   },
> }
> 
> The metadata value (A