Generally I think this is a good idea that has been proposed before but I
don't think we could ever make progress on design.

On Sun, Jun 2, 2024 at 7:17 PM Sutou Kouhei <k...@clear-code.com> wrote:

> Hi,
>
> Related GitHub issue:
> https://github.com/apache/arrow/issues/41909
>
> How about adding arrow::ArrayStatistics?
>
> Motivation:
>
> An Apache Arrow format data doesn't have statistics. (We can
> add statistics as metadata but there isn't any standard way
> for it.)
>
> But a source of an Apache Arrow format data such as Apache
> Parquet format data may have statistics. We can get the
> source statistics via source reader such as
> parquet::ColumnChunkMetaData::statistics() but can't get
> them from read Apache Arrow format data. If we want to use
> the source statistics, we need to keep the source reader.
>
> Proposal:
>
> How about adding arrow::ArrayStatistics or something and
> attaching source statistics to read arrow::Array? If source
> statistics are attached to read arrow::Array, we don't need
> to keep a source reader to get source statistics.
>
> What do you think about this idea?
>
>
> NOTE: I haven't thought about the arrow::ArrayStatistics
> details yet. We'll be able to use parquet::Statistics and
> its family as a reference.
> https://github.com/apache/arrow/blob/main/cpp/src/parquet/statistics.h
>
>
> Thanks,
> --
> kou
>

Reply via email to