Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?

Sutou Kouhei Thu, 13 Jun 2024 01:32:47 -0700

Hi,

Thanks for explaining the Rust implementation.


> The reason for using an array per item (rather than a RecordBatch) is so
> the statistics can be extracted only when needed -- as in if only mins are
> needed, there is no reason to also extract maxes and null counts. I don't
> know how important this is in practice

Thanks. I understand it.

Similar optimization(?) may be able to do with the
RecordBatch approach by specifying the target statistics
explicitly. For example, get_statistics(min) returns a
RecordBatch but it has only min statistics.

If multiple statistics are needed (e.g. min and max for
range search), the RecordBatch approach may be natural. But
the Array approach isn't bad because we can return multiple
Arrays.

So both approaches may not be so different.


Thanks,
-- 
kou

In <cafhtnrwsmp5g0ktrdp_rj5ph37hs-l4uc-ey9zohz-svp7d...@mail.gmail.com>
  "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Tue, 11 Jun 
2024 05:58:48 -0400,
  Andrew Lamb <al...@influxdata.com> wrote:

>> Interesting. A separated but related discussion[10] will use
> a record batch or a map array for statistics and it includes
> multiple statistic items. But arrow-rs (DataFusion) uses an
> Arrow array per statistic item.
> 
> The reason for using an array per item (rather than a RecordBatch) is so
> the statistics can be extracted only when needed -- as in if only mins are
> needed, there is no reason to also extract maxes and null counts. I don't
> know how important this is in practice
> 
>> It seems that datafunsion::common::Statistics[2] (this is a
> higher level statistics object, right?) doesn't use Arrow
> arrays.
> 
> That is correct
> 
>> When extracted Parquet statistics are converted to
> datafunsion::common::Statistics from Arrow arrays in
> DataFusion? (It's WIP?)
> 
> It is currently done in [1], though I expect this code will improve over
> time
> 
> Andrew
> 
> [1]:
> https://github.com/apache/datafusion/blob/47026a2a3dd41a5c87e44ade58d91a89feba147b/datafusion/core/src/datasource/file_format/parquet.rs#L458-L536
> 
> 
> 
> On Tue, Jun 11, 2024 at 3:49 AM Sutou Kouhei <k...@clear-code.com> wrote:
> 
>> Hi,
>>
>> Thanks for sharing arrow-rs related information!
>>
>> > 2. Code to extract parquet statistics as Arrow arrays[3] (this is a WIP
>> but
>> > I plan to propose upstreaming[4] to arrow-rs when complete)
>>
>> Interesting. A separated but related discussion[10] will use
>> a record batch or a map array for statistics and it includes
>> multiple statistic items. But arrow-rs (DataFusion) uses an
>> Arrow array per statistic item.
>>
>> It seems that datafunsion::common::Statistics[2] (this is a
>> higher level statistics object, right?) doesn't use Arrow
>> arrays. When extracted Parquet statistics are converted to
>> datafunsion::common::Statistics from Arrow arrays in
>> DataFusion? (It's WIP?)
>>
>>
>> [10] https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cx
>>
>>
>> Thanks,
>> --
>> kou
>>
>> In <CAFhtnRwgixMc_vFmogNZ7V=46mjg53md+dsftcw_c5qvyhs...@mail.gmail.com>
>>   "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Mon, 10
>> Jun 2024 11:26:23 -0400,
>>   Andrew Lamb <al...@influxdata.com> wrote:
>>
>> >> (Does arrow-rs compute statistics from in-memory Arrow array?)
>> >
>> > Not really, though there are kernels[1] to do so for some types
>> >
>> > We have two related concepts in the Rust ecosystem:
>> > 1. Full on statistics in DataFusion [2] (though no great way at the
>> moment)
>> > 2. Code to extract parquet statistics as Arrow arrays[3] (this is a WIP
>> but
>> > I plan to propose upstreaming[4] to arrow-rs when complete)
>> >
>> > I think that  code to extract statistics from Parquet files as arrow
>> arrays
>> > is a very important feature (and lets query engines do row group and data
>> > page prunng).
>> >
>> > The value of a  higher level Statistics object is a little less clear to
>> me
>> > -- query engines end up with all sorts of complicated calculations on
>> such
>> > objects (like predicate selectivity, NDV estimation, boundary analysis,
>> > etc) that finding what level makes sense in arrow might be hard.
>> >
>> > Andrew
>> >
>> > [1]: https://docs.rs/arrow/latest/arrow/compute/fn.min.html
>> > [2]:
>> >
>> https://docs.rs/datafusion/latest/datafusion/common/struct.Statistics.html
>> > [3]:
>> >
>> https://github.com/apache/datafusion/blob/e094f94d2a3f23128ce782a20982dbf7ac1ebed2/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L579
>> > [4]: https://github.com/apache/arrow-rs/issues/4328
>> >
>> > On Sun, Jun 9, 2024 at 3:40 AM Sutou Kouhei <k...@clear-code.com> wrote:
>> >
>> >> Hi,
>> >>
>> >> Thanks for your comment.
>> >>
>> >> You may misunderstand my motivation.
>> >>
>> >> This proposal doesn't change the Apache Arrow columnar
>> >> format. For example, this proposal doesn't save statistics
>> >> read from Apache Parquet file to Apache Arrow IPC file. This
>> >> proposal just attaches statistics read from Apache Parquet
>> >> file to in-memory arrow::Array C++ objects. It's just for
>> >> easy to use in-memory arrow::Array C++ objects.
>> >>
>> >> This proposal doesn't compute statistics from in-memory
>> >> arrow::Array C++ objects. (We may want to do it later but
>> >> this proposal doesn't propose it.)
>> >>
>> >> (Does arrow-rs compute statistics from in-memory Arrow
>> >> array?)
>> >>
>> >>
>> >> Thanks,
>> >> --
>> >> kou
>> >>
>> >> In <CAOYPqDBM0ocns5=t6anzg-bqwmgkervhw_5ru4qomewqtaq...@mail.gmail.com>
>> >>   "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Thu,
>> 6
>> >> Jun 2024 08:13:11 +0200,
>> >>   Jorge Cardoso Leitão <jorgecarlei...@gmail.com> wrote:
>> >>
>> >> > Hi
>> >> >
>> >> > This is c++ specific, but imo the question applies more broadly.
>> >> >
>> >> > I understood that the rationale for stats in compressed+encoded
>> formats
>> >> > like parquet is that computing those stats has a high cost (io +
>> >> decompress
>> >> > + decode + aggregate). This motivates the materialization of
>> aggregates.
>> >> >
>> >> > In arrow the data is already in an in-memory format (e.g. IPC+mmap,
>> or in
>> >> > the heap) and the cost is thus smaller (aggregate).
>> >> >
>> >> > It could be useful to quantify how much is being saved vs how much
>> >> > complexity is being added to the format + implementations.
>> >> >
>> >> > Best,
>> >> > Jorge
>> >> >
>> >> >
>> >> > On Thu, Jun 6, 2024, 07:55 Micah Kornfield <emkornfi...@gmail.com>
>> >> wrote:
>> >> >
>> >> >> Generally I think this is a good idea that has been proposed before
>> but
>> >> I
>> >> >> don't think we could ever make progress on design.
>> >> >>
>> >> >> On Sun, Jun 2, 2024 at 7:17 PM Sutou Kouhei <k...@clear-code.com>
>> wrote:
>> >> >>
>> >> >> > Hi,
>> >> >> >
>> >> >> > Related GitHub issue:
>> >> >> > https://github.com/apache/arrow/issues/41909
>> >> >> >
>> >> >> > How about adding arrow::ArrayStatistics?
>> >> >> >
>> >> >> > Motivation:
>> >> >> >
>> >> >> > An Apache Arrow format data doesn't have statistics. (We can
>> >> >> > add statistics as metadata but there isn't any standard way
>> >> >> > for it.)
>> >> >> >
>> >> >> > But a source of an Apache Arrow format data such as Apache
>> >> >> > Parquet format data may have statistics. We can get the
>> >> >> > source statistics via source reader such as
>> >> >> > parquet::ColumnChunkMetaData::statistics() but can't get
>> >> >> > them from read Apache Arrow format data. If we want to use
>> >> >> > the source statistics, we need to keep the source reader.
>> >> >> >
>> >> >> > Proposal:
>> >> >> >
>> >> >> > How about adding arrow::ArrayStatistics or something and
>> >> >> > attaching source statistics to read arrow::Array? If source
>> >> >> > statistics are attached to read arrow::Array, we don't need
>> >> >> > to keep a source reader to get source statistics.
>> >> >> >
>> >> >> > What do you think about this idea?
>> >> >> >
>> >> >> >
>> >> >> > NOTE: I haven't thought about the arrow::ArrayStatistics
>> >> >> > details yet. We'll be able to use parquet::Statistics and
>> >> >> > its family as a reference.
>> >> >> >
>> >> https://github.com/apache/arrow/blob/main/cpp/src/parquet/statistics.h
>> >> >> >
>> >> >> >
>> >> >> > Thanks,
>> >> >> > --
>> >> >> > kou
>> >> >> >
>> >> >>
>> >>
>>

Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?

Reply via email to