Hi, I implement the first version of arrow::ArrayStatistics: https://github.com/apache/arrow/pull/42133
The PR description and this e-mail explain discussion points. Could you share your comments? Implementation: arrow::ArrayStatistics: ---- /// \brief Statistics for an Array /// /// Apache Arrow format doesn't have statistics but data source such /// as Apache Parquet may have statistics. Statistics associate with /// data source can be read unified API via this class. struct ARROW_EXPORT ArrayStatistics { public: // DISCUSS: // * How do we support binary/text data? // * Should we use arrow::Scalar instead? using ElementBufferType = std::variant<bool, int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t>; ArrayStatistics() = default; ~ArrayStatistics() = default; /// \brief The number of null values, may not be set std::optional<int64_t> null_count = std::nullopt; /// \brief The number of distinct values, may not be set std::optional<int64_t> distinct_count = std::nullopt; /// \brief The current minimum value buffer, may not be set std::optional<ElementBufferType> min_buffer = std::nullopt; /// \brief The current maximum value buffer, may not be set std::optional<ElementBufferType> max_buffer = std::nullopt; /// \brief Check two Statistics for equality bool Equals(const ArrayStatistics& other) const { return null_count == other.null_count && distinct_count == other.distinct_count && min_buffer == other.min_buffer && max_buffer == other.max_buffer; } }; /// \brief A typed implementation of ArrayStatistics for convenience template <typename TypeClass> class TypedArrayStatistics : public ArrayStatistics { public: using ElementType = typename TypeClass::c_type; /// \brief The current minimum value, may not be set std::optional<ElementType> min() const { if (min_buffer && std::holds_alternative<ElementType>(*min_buffer)) { return std::get<ElementType>(*min_buffer); } else { return std::nullopt; } } /// \brief The current maximum value, may not be set std::optional<ElementType> max() const { if (max_buffer && std::holds_alternative<ElementType>(*max_buffer)) { return std::get<ElementType>(*max_buffer); } else { return std::nullopt; } } }; ---- arrow::ArrayData has arrow::ArrayStatistics and arrow::Array::statistics() returns it: * https://github.com/apache/arrow/pull/42133/files#diff-c5d8f67626f30c66a79d1f677ee81264a9b044280e942cfa846e07306b25365bR400 * https://github.com/apache/arrow/pull/42133/files#diff-5557ff71f88c165e161f89fbb390e1c1fef70d067a892475aa77471b13e1e322R241 Points: arrow::ArrayData not arrow::Array has arrow::ArrayStatistics to keep backward compatibility and extensibility. arrow::Array is immutable. So we need to set all values in constructor. The current arrow::*Array constructors have many arguments. Adding one more argument for arrow::ArrayStatistics breaks ABI and reduces usability. If arrow::ArrayData has arrow::ArrayStatistics, we can prepare arrow::ArrayStatistics out of constructor with arrow::ArrayData. Because arrow::ArrayData is mutable. So we can change arrow::ArrayData not in constructor. Limitations: This supports only the following types for now: * BooleanType * Int*Type * UInt*Type BinaryType and StringType family support will be difficult than them because it requires variable size storage. Demo: This includes a demo how to associate statistics read from Apache Parquet data to arrow::BooleanArray/arrow::Int*Array/arrow::UInt*Array. https://github.com/apache/arrow/pull/42133/files#diff-1adfd998f8b1d8aef97c028e7f8d49fea94dbf42481d640be34e488d51581c0e Discussions: * How do we support min/max of binary/text data? * Parquet has FLBAStatistics (FLBA = Fixed Length Byte Array). * Can we use std::string_view for it? * If so, where do we store the original value? * Can we use std::string? * It needs a copy and copying large data will not be desired. * Should we use arrow::Scalar not std::variant<bool, int8_t, ...>? * If we use arrow::Scalar, we can support complex value such as list and struct. But arrow::Scalar may be heavy. * Should this be opt-in? Thanks, -- kou In <20240609.161934.1193014034909936181....@clear-code.com> "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Sun, 09 Jun 2024 16:19:34 +0900 (JST), Sutou Kouhei <k...@clear-code.com> wrote: > Hi, > > OK. I'll propose arrow::ArrayStatistics API that can be used > as a starting point. > > > Thanks, > -- > kou > > In <cak7z5t8qsk0qnnwez4tbpj9x1p2oqy-t6karh-jh4zbjhe9...@mail.gmail.com> > "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Wed, 5 Jun > 2024 22:55:25 -0700, > Micah Kornfield <emkornfi...@gmail.com> wrote: > >> Generally I think this is a good idea that has been proposed before but I >> don't think we could ever make progress on design. >> >> On Sun, Jun 2, 2024 at 7:17 PM Sutou Kouhei <k...@clear-code.com> wrote: >> >>> Hi, >>> >>> Related GitHub issue: >>> https://github.com/apache/arrow/issues/41909 >>> >>> How about adding arrow::ArrayStatistics? >>> >>> Motivation: >>> >>> An Apache Arrow format data doesn't have statistics. (We can >>> add statistics as metadata but there isn't any standard way >>> for it.) >>> >>> But a source of an Apache Arrow format data such as Apache >>> Parquet format data may have statistics. We can get the >>> source statistics via source reader such as >>> parquet::ColumnChunkMetaData::statistics() but can't get >>> them from read Apache Arrow format data. If we want to use >>> the source statistics, we need to keep the source reader. >>> >>> Proposal: >>> >>> How about adding arrow::ArrayStatistics or something and >>> attaching source statistics to read arrow::Array? If source >>> statistics are attached to read arrow::Array, we don't need >>> to keep a source reader to get source statistics. >>> >>> What do you think about this idea? >>> >>> >>> NOTE: I haven't thought about the arrow::ArrayStatistics >>> details yet. We'll be able to use parquet::Statistics and >>> its family as a reference. >>> https://github.com/apache/arrow/blob/main/cpp/src/parquet/statistics.h >>> >>> >>> Thanks, >>> -- >>> kou >>>