Hi,
I've updated the PoC of arrow::ArrayStatistics:
https://github.com/apache/arrow/pull/42133
Changes:
* Move arrow::ArrayStatistics from arrow::ArrayData to
arrow::Array
* Because arrow::ArrayData is mutable. If we change
arrow::ArrayData, we should update or remove associated
statistics too.
See also the discussion:
https://github.com/apache/arrow/pull/42133#discussion_r1640175130
* Add is_min_exact/is_max_exact
* Because the latest Parquet has these information.
(Parquet C++ doesn't support them yet.)
See also the discussion:
https://github.com/apache/arrow/pull/42133#discussion_r1638336729
What do you think about this approach?
If there is no objection, I'll complete this approach.
Thanks,
--
kou
In <[email protected]>
"Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Thu, 13 Jun
2024 17:46:29 +0900 (JST),
Sutou Kouhei <[email protected]> wrote:
> Hi,
>
> I implement the first version of arrow::ArrayStatistics:
> https://github.com/apache/arrow/pull/42133
>
> The PR description and this e-mail explain discussion
> points. Could you share your comments?
>
> Implementation:
>
> arrow::ArrayStatistics:
>
> ----
> /// \brief Statistics for an Array
> ///
> /// Apache Arrow format doesn't have statistics but data source such
> /// as Apache Parquet may have statistics. Statistics associate with
> /// data source can be read unified API via this class.
> struct ARROW_EXPORT ArrayStatistics {
> public:
> // DISCUSS:
> // * How do we support binary/text data?
> // * Should we use arrow::Scalar instead?
> using ElementBufferType = std::variant<bool, int8_t, uint8_t, int16_t,
> uint16_t,
> int32_t, uint32_t, int64_t,
> uint64_t>;
>
> ArrayStatistics() = default;
> ~ArrayStatistics() = default;
>
> /// \brief The number of null values, may not be set
> std::optional<int64_t> null_count = std::nullopt;
>
> /// \brief The number of distinct values, may not be set
> std::optional<int64_t> distinct_count = std::nullopt;
>
> /// \brief The current minimum value buffer, may not be set
> std::optional<ElementBufferType> min_buffer = std::nullopt;
>
> /// \brief The current maximum value buffer, may not be set
> std::optional<ElementBufferType> max_buffer = std::nullopt;
>
> /// \brief Check two Statistics for equality
> bool Equals(const ArrayStatistics& other) const {
> return null_count == other.null_count && distinct_count ==
> other.distinct_count &&
> min_buffer == other.min_buffer && max_buffer == other.max_buffer;
> }
> };
>
> /// \brief A typed implementation of ArrayStatistics for convenience
> template <typename TypeClass>
> class TypedArrayStatistics : public ArrayStatistics {
> public:
> using ElementType = typename TypeClass::c_type;
>
> /// \brief The current minimum value, may not be set
> std::optional<ElementType> min() const {
> if (min_buffer && std::holds_alternative<ElementType>(*min_buffer)) {
> return std::get<ElementType>(*min_buffer);
> } else {
> return std::nullopt;
> }
> }
>
> /// \brief The current maximum value, may not be set
> std::optional<ElementType> max() const {
> if (max_buffer && std::holds_alternative<ElementType>(*max_buffer)) {
> return std::get<ElementType>(*max_buffer);
> } else {
> return std::nullopt;
> }
> }
> };
> ----
>
> arrow::ArrayData has arrow::ArrayStatistics and
> arrow::Array::statistics() returns it:
>
> *
> https://github.com/apache/arrow/pull/42133/files#diff-c5d8f67626f30c66a79d1f677ee81264a9b044280e942cfa846e07306b25365bR400
> *
> https://github.com/apache/arrow/pull/42133/files#diff-5557ff71f88c165e161f89fbb390e1c1fef70d067a892475aa77471b13e1e322R241
>
>
> Points:
>
> arrow::ArrayData not arrow::Array has arrow::ArrayStatistics
> to keep backward compatibility and
> extensibility. arrow::Array is immutable. So we need to set
> all values in constructor. The current arrow::*Array
> constructors have many arguments. Adding one more argument
> for arrow::ArrayStatistics breaks ABI and reduces usability.
>
> If arrow::ArrayData has arrow::ArrayStatistics, we can
> prepare arrow::ArrayStatistics out of constructor with
> arrow::ArrayData. Because arrow::ArrayData is mutable. So we
> can change arrow::ArrayData not in constructor.
>
> Limitations:
>
> This supports only the following types for now:
>
> * BooleanType
> * Int*Type
> * UInt*Type
>
> BinaryType and StringType family support will be difficult
> than them because it requires variable size storage.
>
> Demo:
>
> This includes a demo how to associate statistics read from
> Apache Parquet data to
> arrow::BooleanArray/arrow::Int*Array/arrow::UInt*Array.
>
> https://github.com/apache/arrow/pull/42133/files#diff-1adfd998f8b1d8aef97c028e7f8d49fea94dbf42481d640be34e488d51581c0e
>
> Discussions:
>
> * How do we support min/max of binary/text data?
> * Parquet has FLBAStatistics (FLBA = Fixed Length Byte Array).
> * Can we use std::string_view for it?
> * If so, where do we store the original value?
> * Can we use std::string?
> * It needs a copy and copying large data will not be desired.
>
> * Should we use arrow::Scalar not std::variant<bool, int8_t, ...>?
> * If we use arrow::Scalar, we can support complex value
> such as list and struct. But arrow::Scalar may be
> heavy.
>
> * Should this be opt-in?
>
>
> Thanks,
> --
> kou
>
> In <[email protected]>
> "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Sun, 09
> Jun 2024 16:19:34 +0900 (JST),
> Sutou Kouhei <[email protected]> wrote:
>
>> Hi,
>>
>> OK. I'll propose arrow::ArrayStatistics API that can be used
>> as a starting point.
>>
>>
>> Thanks,
>> --
>> kou
>>
>> In <cak7z5t8qsk0qnnwez4tbpj9x1p2oqy-t6karh-jh4zbjhe9...@mail.gmail.com>
>> "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Wed, 5
>> Jun 2024 22:55:25 -0700,
>> Micah Kornfield <[email protected]> wrote:
>>
>>> Generally I think this is a good idea that has been proposed before but I
>>> don't think we could ever make progress on design.
>>>
>>> On Sun, Jun 2, 2024 at 7:17 PM Sutou Kouhei <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> Related GitHub issue:
>>>> https://github.com/apache/arrow/issues/41909
>>>>
>>>> How about adding arrow::ArrayStatistics?
>>>>
>>>> Motivation:
>>>>
>>>> An Apache Arrow format data doesn't have statistics. (We can
>>>> add statistics as metadata but there isn't any standard way
>>>> for it.)
>>>>
>>>> But a source of an Apache Arrow format data such as Apache
>>>> Parquet format data may have statistics. We can get the
>>>> source statistics via source reader such as
>>>> parquet::ColumnChunkMetaData::statistics() but can't get
>>>> them from read Apache Arrow format data. If we want to use
>>>> the source statistics, we need to keep the source reader.
>>>>
>>>> Proposal:
>>>>
>>>> How about adding arrow::ArrayStatistics or something and
>>>> attaching source statistics to read arrow::Array? If source
>>>> statistics are attached to read arrow::Array, we don't need
>>>> to keep a source reader to get source statistics.
>>>>
>>>> What do you think about this idea?
>>>>
>>>>
>>>> NOTE: I haven't thought about the arrow::ArrayStatistics
>>>> details yet. We'll be able to use parquet::Statistics and
>>>> its family as a reference.
>>>> https://github.com/apache/arrow/blob/main/cpp/src/parquet/statistics.h
>>>>
>>>>
>>>> Thanks,
>>>> --
>>>> kou
>>>>