Hi,

I've merge only the arrow::ArrayStatistics part:
https://github.com/apache/arrow/pull/43273

I'll add Apache Parquet integration later:
https://github.com/apache/arrow/issues/43549


Thanks,
-- 
kou

In <20240725.140825.2280958205506854888....@clear-code.com>
  "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Thu, 25 Jul 
2024 14:08:25 +0900 (JST),
  Sutou Kouhei <k...@clear-code.com> wrote:

> Hi,
> 
> I've updated the arrow::ArrayStatistics implementation:
> https://github.com/apache/arrow/pull/43273
> 
> It's a separated PR of the previous PR:
> https://github.com/apache/arrow/pull/42133
> 
> The new PR doesn't include Apache Parquet integration. It
> just adds arrow::ArrayStatistics. We'll add Apache Parquet
> integration after we introduce arrow::ArrayStatistics.
> 
> Changes:
> 
> * Remove Apache Parquet integration
> * Remove arrow::TypedArrayStatistics
> * Rename arrow::ArrayStatistics::ElementBufferType to
>   arrow::ArrayStatistics::ValueType
> * Rename arrow::ArrayStatistics::{min,max}_buffer to
>   arrow::ArrayStatistics::{min,max}
> * Add std::string and std::string_view to
>   arrow::ArrayStatistics::ValueType for string/binary
>   values
> 
> If there is no objection, I'll merge this in the next week.
> 
> 
> Thanks,
> -- 
> kou
> 
> In <20240711.163002.1027113988921053382....@clear-code.com>
>   "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Thu, 11 
> Jul 2024 16:30:02 +0900 (JST),
>   Sutou Kouhei <k...@clear-code.com> wrote:
> 
>> Hi,
>> 
>> I've updated the PoC of arrow::ArrayStatistics:
>> https://github.com/apache/arrow/pull/42133
>> 
>> Changes:
>> 
>> * Move arrow::ArrayStatistics from arrow::ArrayData to
>>   arrow::Array
>>   * Because arrow::ArrayData is mutable. If we change
>>     arrow::ArrayData, we should update or remove associated
>>     statistics too.
>>     See also the discussion:
>>     https://github.com/apache/arrow/pull/42133#discussion_r1640175130
>> * Add is_min_exact/is_max_exact
>>   * Because the latest Parquet has these information.
>>     (Parquet C++ doesn't support them yet.)
>>     See also the discussion:
>>     https://github.com/apache/arrow/pull/42133#discussion_r1638336729
>> 
>> 
>> What do you think about this approach?
>> If there is no objection, I'll complete this approach.
>> 
>> 
>> Thanks,
>> -- 
>> kou
>> 
>> In <20240613.174629.1705005054564708650....@clear-code.com>
>>   "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Thu, 13 
>> Jun 2024 17:46:29 +0900 (JST),
>>   Sutou Kouhei <k...@clear-code.com> wrote:
>> 
>>> Hi,
>>> 
>>> I implement the first version of arrow::ArrayStatistics:
>>> https://github.com/apache/arrow/pull/42133
>>> 
>>> The PR description and this e-mail explain discussion
>>> points. Could you share your comments?
>>> 
>>> Implementation:
>>> 
>>> arrow::ArrayStatistics:
>>> 
>>> ----
>>> /// \brief Statistics for an Array
>>> ///
>>> /// Apache Arrow format doesn't have statistics but data source such
>>> /// as Apache Parquet may have statistics. Statistics associate with
>>> /// data source can be read unified API via this class.
>>> struct ARROW_EXPORT ArrayStatistics {
>>>  public:
>>>   // DISCUSS:
>>>   //   * How do we support binary/text data?
>>>   //   * Should we use arrow::Scalar instead?
>>>   using ElementBufferType = std::variant<bool, int8_t, uint8_t, int16_t, 
>>> uint16_t,
>>>                                          int32_t, uint32_t, int64_t, 
>>> uint64_t>;
>>> 
>>>   ArrayStatistics() = default;
>>>   ~ArrayStatistics() = default;
>>> 
>>>   /// \brief The number of null values, may not be set
>>>   std::optional<int64_t> null_count = std::nullopt;
>>> 
>>>   /// \brief The number of distinct values, may not be set
>>>   std::optional<int64_t> distinct_count = std::nullopt;
>>> 
>>>   /// \brief The current minimum value buffer, may not be set
>>>   std::optional<ElementBufferType> min_buffer = std::nullopt;
>>> 
>>>   /// \brief The current maximum value buffer, may not be set
>>>   std::optional<ElementBufferType> max_buffer = std::nullopt;
>>> 
>>>   /// \brief Check two Statistics for equality
>>>   bool Equals(const ArrayStatistics& other) const {
>>>     return null_count == other.null_count && distinct_count == 
>>> other.distinct_count &&
>>>            min_buffer == other.min_buffer && max_buffer == other.max_buffer;
>>>   }
>>> };
>>> 
>>> /// \brief A typed implementation of ArrayStatistics for convenience
>>> template <typename TypeClass>
>>> class TypedArrayStatistics : public ArrayStatistics {
>>>  public:
>>>   using ElementType = typename TypeClass::c_type;
>>> 
>>>   /// \brief The current minimum value, may not be set
>>>   std::optional<ElementType> min() const {
>>>     if (min_buffer && std::holds_alternative<ElementType>(*min_buffer)) {
>>>       return std::get<ElementType>(*min_buffer);
>>>     } else {
>>>       return std::nullopt;
>>>     }
>>>   }
>>> 
>>>   /// \brief The current maximum value, may not be set
>>>   std::optional<ElementType> max() const {
>>>     if (max_buffer && std::holds_alternative<ElementType>(*max_buffer)) {
>>>       return std::get<ElementType>(*max_buffer);
>>>     } else {
>>>       return std::nullopt;
>>>     }
>>>   }
>>> };
>>> ----
>>> 
>>> arrow::ArrayData has arrow::ArrayStatistics and
>>> arrow::Array::statistics() returns it:
>>> 
>>> * 
>>> https://github.com/apache/arrow/pull/42133/files#diff-c5d8f67626f30c66a79d1f677ee81264a9b044280e942cfa846e07306b25365bR400
>>> * 
>>> https://github.com/apache/arrow/pull/42133/files#diff-5557ff71f88c165e161f89fbb390e1c1fef70d067a892475aa77471b13e1e322R241
>>> 
>>> 
>>> Points:
>>> 
>>> arrow::ArrayData not arrow::Array has arrow::ArrayStatistics
>>> to keep backward compatibility and
>>> extensibility. arrow::Array is immutable. So we need to set
>>> all values in constructor. The current arrow::*Array
>>> constructors have many arguments. Adding one more argument
>>> for arrow::ArrayStatistics breaks ABI and reduces usability.
>>> 
>>> If arrow::ArrayData has arrow::ArrayStatistics, we can
>>> prepare arrow::ArrayStatistics out of constructor with
>>> arrow::ArrayData. Because arrow::ArrayData is mutable. So we
>>> can change arrow::ArrayData not in constructor.
>>> 
>>> Limitations:
>>> 
>>> This supports only the following types for now:
>>> 
>>> * BooleanType
>>> * Int*Type
>>> * UInt*Type
>>> 
>>> BinaryType and StringType family support will be difficult
>>> than them because it requires variable size storage.
>>> 
>>> Demo:
>>> 
>>> This includes a demo how to associate statistics read from
>>> Apache Parquet data to
>>> arrow::BooleanArray/arrow::Int*Array/arrow::UInt*Array.
>>> 
>>> https://github.com/apache/arrow/pull/42133/files#diff-1adfd998f8b1d8aef97c028e7f8d49fea94dbf42481d640be34e488d51581c0e
>>> 
>>> Discussions:
>>> 
>>> * How do we support min/max of binary/text data?
>>>   * Parquet has FLBAStatistics (FLBA = Fixed Length Byte Array).
>>>   * Can we use std::string_view for it?
>>>     * If so, where do we store the original value?
>>>   * Can we use std::string?
>>>     * It needs a copy and copying large data will not be desired.
>>> 
>>> * Should we use arrow::Scalar not std::variant<bool, int8_t, ...>?
>>>   * If we use arrow::Scalar, we can support complex value
>>>     such as list and struct. But arrow::Scalar may be
>>>     heavy.
>>> 
>>> * Should this be opt-in?
>>> 
>>> 
>>> Thanks,
>>> -- 
>>> kou
>>> 
>>> In <20240609.161934.1193014034909936181....@clear-code.com>
>>>   "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Sun, 09 
>>> Jun 2024 16:19:34 +0900 (JST),
>>>   Sutou Kouhei <k...@clear-code.com> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> OK. I'll propose arrow::ArrayStatistics API that can be used
>>>> as a starting point.
>>>> 
>>>> 
>>>> Thanks,
>>>> -- 
>>>> kou
>>>> 
>>>> In <cak7z5t8qsk0qnnwez4tbpj9x1p2oqy-t6karh-jh4zbjhe9...@mail.gmail.com>
>>>>   "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Wed, 5 
>>>> Jun 2024 22:55:25 -0700,
>>>>   Micah Kornfield <emkornfi...@gmail.com> wrote:
>>>> 
>>>>> Generally I think this is a good idea that has been proposed before but I
>>>>> don't think we could ever make progress on design.
>>>>> 
>>>>> On Sun, Jun 2, 2024 at 7:17 PM Sutou Kouhei <k...@clear-code.com> wrote:
>>>>> 
>>>>>> Hi,
>>>>>>
>>>>>> Related GitHub issue:
>>>>>> https://github.com/apache/arrow/issues/41909
>>>>>>
>>>>>> How about adding arrow::ArrayStatistics?
>>>>>>
>>>>>> Motivation:
>>>>>>
>>>>>> An Apache Arrow format data doesn't have statistics. (We can
>>>>>> add statistics as metadata but there isn't any standard way
>>>>>> for it.)
>>>>>>
>>>>>> But a source of an Apache Arrow format data such as Apache
>>>>>> Parquet format data may have statistics. We can get the
>>>>>> source statistics via source reader such as
>>>>>> parquet::ColumnChunkMetaData::statistics() but can't get
>>>>>> them from read Apache Arrow format data. If we want to use
>>>>>> the source statistics, we need to keep the source reader.
>>>>>>
>>>>>> Proposal:
>>>>>>
>>>>>> How about adding arrow::ArrayStatistics or something and
>>>>>> attaching source statistics to read arrow::Array? If source
>>>>>> statistics are attached to read arrow::Array, we don't need
>>>>>> to keep a source reader to get source statistics.
>>>>>>
>>>>>> What do you think about this idea?
>>>>>>
>>>>>>
>>>>>> NOTE: I haven't thought about the arrow::ArrayStatistics
>>>>>> details yet. We'll be able to use parquet::Statistics and
>>>>>> its family as a reference.
>>>>>> https://github.com/apache/arrow/blob/main/cpp/src/parquet/statistics.h
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> --
>>>>>> kou
>>>>>>

Reply via email to