Hi,

I implement the first version of arrow::ArrayStatistics:
https://github.com/apache/arrow/pull/42133

The PR description and this e-mail explain discussion
points. Could you share your comments?

Implementation:

arrow::ArrayStatistics:

----
/// \brief Statistics for an Array
///
/// Apache Arrow format doesn't have statistics but data source such
/// as Apache Parquet may have statistics. Statistics associate with
/// data source can be read unified API via this class.
struct ARROW_EXPORT ArrayStatistics {
 public:
  // DISCUSS:
  //   * How do we support binary/text data?
  //   * Should we use arrow::Scalar instead?
  using ElementBufferType = std::variant<bool, int8_t, uint8_t, int16_t, 
uint16_t,
                                         int32_t, uint32_t, int64_t, uint64_t>;

  ArrayStatistics() = default;
  ~ArrayStatistics() = default;

  /// \brief The number of null values, may not be set
  std::optional<int64_t> null_count = std::nullopt;

  /// \brief The number of distinct values, may not be set
  std::optional<int64_t> distinct_count = std::nullopt;

  /// \brief The current minimum value buffer, may not be set
  std::optional<ElementBufferType> min_buffer = std::nullopt;

  /// \brief The current maximum value buffer, may not be set
  std::optional<ElementBufferType> max_buffer = std::nullopt;

  /// \brief Check two Statistics for equality
  bool Equals(const ArrayStatistics& other) const {
    return null_count == other.null_count && distinct_count == 
other.distinct_count &&
           min_buffer == other.min_buffer && max_buffer == other.max_buffer;
  }
};

/// \brief A typed implementation of ArrayStatistics for convenience
template <typename TypeClass>
class TypedArrayStatistics : public ArrayStatistics {
 public:
  using ElementType = typename TypeClass::c_type;

  /// \brief The current minimum value, may not be set
  std::optional<ElementType> min() const {
    if (min_buffer && std::holds_alternative<ElementType>(*min_buffer)) {
      return std::get<ElementType>(*min_buffer);
    } else {
      return std::nullopt;
    }
  }

  /// \brief The current maximum value, may not be set
  std::optional<ElementType> max() const {
    if (max_buffer && std::holds_alternative<ElementType>(*max_buffer)) {
      return std::get<ElementType>(*max_buffer);
    } else {
      return std::nullopt;
    }
  }
};
----

arrow::ArrayData has arrow::ArrayStatistics and
arrow::Array::statistics() returns it:

* 
https://github.com/apache/arrow/pull/42133/files#diff-c5d8f67626f30c66a79d1f677ee81264a9b044280e942cfa846e07306b25365bR400
* 
https://github.com/apache/arrow/pull/42133/files#diff-5557ff71f88c165e161f89fbb390e1c1fef70d067a892475aa77471b13e1e322R241


Points:

arrow::ArrayData not arrow::Array has arrow::ArrayStatistics
to keep backward compatibility and
extensibility. arrow::Array is immutable. So we need to set
all values in constructor. The current arrow::*Array
constructors have many arguments. Adding one more argument
for arrow::ArrayStatistics breaks ABI and reduces usability.

If arrow::ArrayData has arrow::ArrayStatistics, we can
prepare arrow::ArrayStatistics out of constructor with
arrow::ArrayData. Because arrow::ArrayData is mutable. So we
can change arrow::ArrayData not in constructor.

Limitations:

This supports only the following types for now:

* BooleanType
* Int*Type
* UInt*Type

BinaryType and StringType family support will be difficult
than them because it requires variable size storage.

Demo:

This includes a demo how to associate statistics read from
Apache Parquet data to
arrow::BooleanArray/arrow::Int*Array/arrow::UInt*Array.

https://github.com/apache/arrow/pull/42133/files#diff-1adfd998f8b1d8aef97c028e7f8d49fea94dbf42481d640be34e488d51581c0e

Discussions:

* How do we support min/max of binary/text data?
  * Parquet has FLBAStatistics (FLBA = Fixed Length Byte Array).
  * Can we use std::string_view for it?
    * If so, where do we store the original value?
  * Can we use std::string?
    * It needs a copy and copying large data will not be desired.

* Should we use arrow::Scalar not std::variant<bool, int8_t, ...>?
  * If we use arrow::Scalar, we can support complex value
    such as list and struct. But arrow::Scalar may be
    heavy.

* Should this be opt-in?


Thanks,
-- 
kou

In <20240609.161934.1193014034909936181....@clear-code.com>
  "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Sun, 09 Jun 
2024 16:19:34 +0900 (JST),
  Sutou Kouhei <k...@clear-code.com> wrote:

> Hi,
> 
> OK. I'll propose arrow::ArrayStatistics API that can be used
> as a starting point.
> 
> 
> Thanks,
> -- 
> kou
> 
> In <cak7z5t8qsk0qnnwez4tbpj9x1p2oqy-t6karh-jh4zbjhe9...@mail.gmail.com>
>   "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Wed, 5 Jun 
> 2024 22:55:25 -0700,
>   Micah Kornfield <emkornfi...@gmail.com> wrote:
> 
>> Generally I think this is a good idea that has been proposed before but I
>> don't think we could ever make progress on design.
>> 
>> On Sun, Jun 2, 2024 at 7:17 PM Sutou Kouhei <k...@clear-code.com> wrote:
>> 
>>> Hi,
>>>
>>> Related GitHub issue:
>>> https://github.com/apache/arrow/issues/41909
>>>
>>> How about adding arrow::ArrayStatistics?
>>>
>>> Motivation:
>>>
>>> An Apache Arrow format data doesn't have statistics. (We can
>>> add statistics as metadata but there isn't any standard way
>>> for it.)
>>>
>>> But a source of an Apache Arrow format data such as Apache
>>> Parquet format data may have statistics. We can get the
>>> source statistics via source reader such as
>>> parquet::ColumnChunkMetaData::statistics() but can't get
>>> them from read Apache Arrow format data. If we want to use
>>> the source statistics, we need to keep the source reader.
>>>
>>> Proposal:
>>>
>>> How about adding arrow::ArrayStatistics or something and
>>> attaching source statistics to read arrow::Array? If source
>>> statistics are attached to read arrow::Array, we don't need
>>> to keep a source reader to get source statistics.
>>>
>>> What do you think about this idea?
>>>
>>>
>>> NOTE: I haven't thought about the arrow::ArrayStatistics
>>> details yet. We'll be able to use parquet::Statistics and
>>> its family as a reference.
>>> https://github.com/apache/arrow/blob/main/cpp/src/parquet/statistics.h
>>>
>>>
>>> Thanks,
>>> --
>>> kou
>>>

Reply via email to