Generally I think this is a good idea that has been proposed before but I don't think we could ever make progress on design.
On Sun, Jun 2, 2024 at 7:17 PM Sutou Kouhei <k...@clear-code.com> wrote: > Hi, > > Related GitHub issue: > https://github.com/apache/arrow/issues/41909 > > How about adding arrow::ArrayStatistics? > > Motivation: > > An Apache Arrow format data doesn't have statistics. (We can > add statistics as metadata but there isn't any standard way > for it.) > > But a source of an Apache Arrow format data such as Apache > Parquet format data may have statistics. We can get the > source statistics via source reader such as > parquet::ColumnChunkMetaData::statistics() but can't get > them from read Apache Arrow format data. If we want to use > the source statistics, we need to keep the source reader. > > Proposal: > > How about adding arrow::ArrayStatistics or something and > attaching source statistics to read arrow::Array? If source > statistics are attached to read arrow::Array, we don't need > to keep a source reader to get source statistics. > > What do you think about this idea? > > > NOTE: I haven't thought about the arrow::ArrayStatistics > details yet. We'll be able to use parquet::Statistics and > its family as a reference. > https://github.com/apache/arrow/blob/main/cpp/src/parquet/statistics.h > > > Thanks, > -- > kou >