Hi, I've merge only the arrow::ArrayStatistics part: https://github.com/apache/arrow/pull/43273
I'll add Apache Parquet integration later: https://github.com/apache/arrow/issues/43549 Thanks, -- kou In <20240725.140825.2280958205506854888....@clear-code.com> "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Thu, 25 Jul 2024 14:08:25 +0900 (JST), Sutou Kouhei <k...@clear-code.com> wrote: > Hi, > > I've updated the arrow::ArrayStatistics implementation: > https://github.com/apache/arrow/pull/43273 > > It's a separated PR of the previous PR: > https://github.com/apache/arrow/pull/42133 > > The new PR doesn't include Apache Parquet integration. It > just adds arrow::ArrayStatistics. We'll add Apache Parquet > integration after we introduce arrow::ArrayStatistics. > > Changes: > > * Remove Apache Parquet integration > * Remove arrow::TypedArrayStatistics > * Rename arrow::ArrayStatistics::ElementBufferType to > arrow::ArrayStatistics::ValueType > * Rename arrow::ArrayStatistics::{min,max}_buffer to > arrow::ArrayStatistics::{min,max} > * Add std::string and std::string_view to > arrow::ArrayStatistics::ValueType for string/binary > values > > If there is no objection, I'll merge this in the next week. > > > Thanks, > -- > kou > > In <20240711.163002.1027113988921053382....@clear-code.com> > "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Thu, 11 > Jul 2024 16:30:02 +0900 (JST), > Sutou Kouhei <k...@clear-code.com> wrote: > >> Hi, >> >> I've updated the PoC of arrow::ArrayStatistics: >> https://github.com/apache/arrow/pull/42133 >> >> Changes: >> >> * Move arrow::ArrayStatistics from arrow::ArrayData to >> arrow::Array >> * Because arrow::ArrayData is mutable. If we change >> arrow::ArrayData, we should update or remove associated >> statistics too. >> See also the discussion: >> https://github.com/apache/arrow/pull/42133#discussion_r1640175130 >> * Add is_min_exact/is_max_exact >> * Because the latest Parquet has these information. >> (Parquet C++ doesn't support them yet.) >> See also the discussion: >> https://github.com/apache/arrow/pull/42133#discussion_r1638336729 >> >> >> What do you think about this approach? >> If there is no objection, I'll complete this approach. >> >> >> Thanks, >> -- >> kou >> >> In <20240613.174629.1705005054564708650....@clear-code.com> >> "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Thu, 13 >> Jun 2024 17:46:29 +0900 (JST), >> Sutou Kouhei <k...@clear-code.com> wrote: >> >>> Hi, >>> >>> I implement the first version of arrow::ArrayStatistics: >>> https://github.com/apache/arrow/pull/42133 >>> >>> The PR description and this e-mail explain discussion >>> points. Could you share your comments? >>> >>> Implementation: >>> >>> arrow::ArrayStatistics: >>> >>> ---- >>> /// \brief Statistics for an Array >>> /// >>> /// Apache Arrow format doesn't have statistics but data source such >>> /// as Apache Parquet may have statistics. Statistics associate with >>> /// data source can be read unified API via this class. >>> struct ARROW_EXPORT ArrayStatistics { >>> public: >>> // DISCUSS: >>> // * How do we support binary/text data? >>> // * Should we use arrow::Scalar instead? >>> using ElementBufferType = std::variant<bool, int8_t, uint8_t, int16_t, >>> uint16_t, >>> int32_t, uint32_t, int64_t, >>> uint64_t>; >>> >>> ArrayStatistics() = default; >>> ~ArrayStatistics() = default; >>> >>> /// \brief The number of null values, may not be set >>> std::optional<int64_t> null_count = std::nullopt; >>> >>> /// \brief The number of distinct values, may not be set >>> std::optional<int64_t> distinct_count = std::nullopt; >>> >>> /// \brief The current minimum value buffer, may not be set >>> std::optional<ElementBufferType> min_buffer = std::nullopt; >>> >>> /// \brief The current maximum value buffer, may not be set >>> std::optional<ElementBufferType> max_buffer = std::nullopt; >>> >>> /// \brief Check two Statistics for equality >>> bool Equals(const ArrayStatistics& other) const { >>> return null_count == other.null_count && distinct_count == >>> other.distinct_count && >>> min_buffer == other.min_buffer && max_buffer == other.max_buffer; >>> } >>> }; >>> >>> /// \brief A typed implementation of ArrayStatistics for convenience >>> template <typename TypeClass> >>> class TypedArrayStatistics : public ArrayStatistics { >>> public: >>> using ElementType = typename TypeClass::c_type; >>> >>> /// \brief The current minimum value, may not be set >>> std::optional<ElementType> min() const { >>> if (min_buffer && std::holds_alternative<ElementType>(*min_buffer)) { >>> return std::get<ElementType>(*min_buffer); >>> } else { >>> return std::nullopt; >>> } >>> } >>> >>> /// \brief The current maximum value, may not be set >>> std::optional<ElementType> max() const { >>> if (max_buffer && std::holds_alternative<ElementType>(*max_buffer)) { >>> return std::get<ElementType>(*max_buffer); >>> } else { >>> return std::nullopt; >>> } >>> } >>> }; >>> ---- >>> >>> arrow::ArrayData has arrow::ArrayStatistics and >>> arrow::Array::statistics() returns it: >>> >>> * >>> https://github.com/apache/arrow/pull/42133/files#diff-c5d8f67626f30c66a79d1f677ee81264a9b044280e942cfa846e07306b25365bR400 >>> * >>> https://github.com/apache/arrow/pull/42133/files#diff-5557ff71f88c165e161f89fbb390e1c1fef70d067a892475aa77471b13e1e322R241 >>> >>> >>> Points: >>> >>> arrow::ArrayData not arrow::Array has arrow::ArrayStatistics >>> to keep backward compatibility and >>> extensibility. arrow::Array is immutable. So we need to set >>> all values in constructor. The current arrow::*Array >>> constructors have many arguments. Adding one more argument >>> for arrow::ArrayStatistics breaks ABI and reduces usability. >>> >>> If arrow::ArrayData has arrow::ArrayStatistics, we can >>> prepare arrow::ArrayStatistics out of constructor with >>> arrow::ArrayData. Because arrow::ArrayData is mutable. So we >>> can change arrow::ArrayData not in constructor. >>> >>> Limitations: >>> >>> This supports only the following types for now: >>> >>> * BooleanType >>> * Int*Type >>> * UInt*Type >>> >>> BinaryType and StringType family support will be difficult >>> than them because it requires variable size storage. >>> >>> Demo: >>> >>> This includes a demo how to associate statistics read from >>> Apache Parquet data to >>> arrow::BooleanArray/arrow::Int*Array/arrow::UInt*Array. >>> >>> https://github.com/apache/arrow/pull/42133/files#diff-1adfd998f8b1d8aef97c028e7f8d49fea94dbf42481d640be34e488d51581c0e >>> >>> Discussions: >>> >>> * How do we support min/max of binary/text data? >>> * Parquet has FLBAStatistics (FLBA = Fixed Length Byte Array). >>> * Can we use std::string_view for it? >>> * If so, where do we store the original value? >>> * Can we use std::string? >>> * It needs a copy and copying large data will not be desired. >>> >>> * Should we use arrow::Scalar not std::variant<bool, int8_t, ...>? >>> * If we use arrow::Scalar, we can support complex value >>> such as list and struct. But arrow::Scalar may be >>> heavy. >>> >>> * Should this be opt-in? >>> >>> >>> Thanks, >>> -- >>> kou >>> >>> In <20240609.161934.1193014034909936181....@clear-code.com> >>> "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Sun, 09 >>> Jun 2024 16:19:34 +0900 (JST), >>> Sutou Kouhei <k...@clear-code.com> wrote: >>> >>>> Hi, >>>> >>>> OK. I'll propose arrow::ArrayStatistics API that can be used >>>> as a starting point. >>>> >>>> >>>> Thanks, >>>> -- >>>> kou >>>> >>>> In <cak7z5t8qsk0qnnwez4tbpj9x1p2oqy-t6karh-jh4zbjhe9...@mail.gmail.com> >>>> "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Wed, 5 >>>> Jun 2024 22:55:25 -0700, >>>> Micah Kornfield <emkornfi...@gmail.com> wrote: >>>> >>>>> Generally I think this is a good idea that has been proposed before but I >>>>> don't think we could ever make progress on design. >>>>> >>>>> On Sun, Jun 2, 2024 at 7:17 PM Sutou Kouhei <k...@clear-code.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Related GitHub issue: >>>>>> https://github.com/apache/arrow/issues/41909 >>>>>> >>>>>> How about adding arrow::ArrayStatistics? >>>>>> >>>>>> Motivation: >>>>>> >>>>>> An Apache Arrow format data doesn't have statistics. (We can >>>>>> add statistics as metadata but there isn't any standard way >>>>>> for it.) >>>>>> >>>>>> But a source of an Apache Arrow format data such as Apache >>>>>> Parquet format data may have statistics. We can get the >>>>>> source statistics via source reader such as >>>>>> parquet::ColumnChunkMetaData::statistics() but can't get >>>>>> them from read Apache Arrow format data. If we want to use >>>>>> the source statistics, we need to keep the source reader. >>>>>> >>>>>> Proposal: >>>>>> >>>>>> How about adding arrow::ArrayStatistics or something and >>>>>> attaching source statistics to read arrow::Array? If source >>>>>> statistics are attached to read arrow::Array, we don't need >>>>>> to keep a source reader to get source statistics. >>>>>> >>>>>> What do you think about this idea? >>>>>> >>>>>> >>>>>> NOTE: I haven't thought about the arrow::ArrayStatistics >>>>>> details yet. We'll be able to use parquet::Statistics and >>>>>> its family as a reference. >>>>>> https://github.com/apache/arrow/blob/main/cpp/src/parquet/statistics.h >>>>>> >>>>>> >>>>>> Thanks, >>>>>> -- >>>>>> kou >>>>>>