Re: [DISCUSS] Statistics through the C data interface

2024-06-08 Thread Sutou Kouhei
Hi, > The exact types inside the dense_union would be chosen when encoding. Ah, this approach doesn't standardize VALUE_SCHEMA (use a fixed VALUE_SCHEMA). If it works in real world, it's more flexible. > Version markers in two-sided protocols never work well long term: > see Parquet files l

Re: [DISCUSS] Statistics through the C data interface

2024-06-08 Thread Sutou Kouhei
Hi, >> | Name | Type | Comments | >> ||---| | >> | column | utf8 | (2) | > > The uft8 type would presume that column names are unique (although I > like it better than referring to columns by int

Re: [DISCUSS] Statistics through the C data interface

2024-06-08 Thread Sutou Kouhei
Hi, We can use 4. for per-batch statistics. Because 4. uses separated API call. Users can design the separated API call for per-batch statistics. Thanks, -- kou In "Re: [DISCUSS] Statistics through the C data interface" on Thu, 6 Jun 2024 13:14:08 +0200, Alessandro Molina wrote: > I bro

Re: [DISCUSS] Statistics through the C data interface

2024-06-08 Thread Sutou Kouhei
Hi, >> Metadata: >> | Name | Value | Comments | >> ||---|- | >> | ARROW::statistics::version | 1.0.0 | (1) | > > I'm not sure this is useful, but it doesn't hurt. The Apache Arrow columnar format uses semantic versioning. So I th

Re: [Discuss][C++] Switch to mimalloc by default?

2024-06-08 Thread Felipe Oliveira Carvalho
+1. I think the benefits outweigh the risks. On Wed, Jun 5, 2024 at 3:05 PM Anja wrote: > > I did want to start off by acknowledging that all of the pros you listed > for mimalloc are accurate. > > I did want to contribute the times that people have been caught off-guard > by the perceived increa

Re: [DISCUSS] Statistics through the C data interface

2024-06-08 Thread Felipe Oliveira Carvalho
> I just used quantiles as an example of a statistic that's not in the current > proposed spec, but that some engines would like to expose. All statistics are optional so we can always add more to the spec. > In other words, a plain integer makes extensibility more difficult than a > string. O

Re: [DISCUSS] Statistics through the C data interface

2024-06-08 Thread Antoine Pitrou
Le 07/06/2024 à 18:30, Felipe Oliveira Carvalho a écrit : On Fri, Jun 7, 2024 at 6:24 AM Antoine Pitrou wrote: Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit : I've been thinking about how to encode statistics on Arrow arrays and how to keep the set of statistics known by both pr