Hi

This is c++ specific, but imo the question applies more broadly.

I understood that the rationale for stats in compressed+encoded formats
like parquet is that computing those stats has a high cost (io + decompress
+ decode + aggregate). This motivates the materialization of aggregates.

In arrow the data is already in an in-memory format (e.g. IPC+mmap, or in
the heap) and the cost is thus smaller (aggregate).

It could be useful to quantify how much is being saved vs how much
complexity is being added to the format + implementations.

Best,
Jorge


On Thu, Jun 6, 2024, 07:55 Micah Kornfield <emkornfi...@gmail.com> wrote:

> Generally I think this is a good idea that has been proposed before but I
> don't think we could ever make progress on design.
>
> On Sun, Jun 2, 2024 at 7:17 PM Sutou Kouhei <k...@clear-code.com> wrote:
>
> > Hi,
> >
> > Related GitHub issue:
> > https://github.com/apache/arrow/issues/41909
> >
> > How about adding arrow::ArrayStatistics?
> >
> > Motivation:
> >
> > An Apache Arrow format data doesn't have statistics. (We can
> > add statistics as metadata but there isn't any standard way
> > for it.)
> >
> > But a source of an Apache Arrow format data such as Apache
> > Parquet format data may have statistics. We can get the
> > source statistics via source reader such as
> > parquet::ColumnChunkMetaData::statistics() but can't get
> > them from read Apache Arrow format data. If we want to use
> > the source statistics, we need to keep the source reader.
> >
> > Proposal:
> >
> > How about adding arrow::ArrayStatistics or something and
> > attaching source statistics to read arrow::Array? If source
> > statistics are attached to read arrow::Array, we don't need
> > to keep a source reader to get source statistics.
> >
> > What do you think about this idea?
> >
> >
> > NOTE: I haven't thought about the arrow::ArrayStatistics
> > details yet. We'll be able to use parquet::Statistics and
> > its family as a reference.
> > https://github.com/apache/arrow/blob/main/cpp/src/parquet/statistics.h
> >
> >
> > Thanks,
> > --
> > kou
> >
>

Reply via email to