AntoinePrv commented on code in PR #48549:
URL: https://github.com/apache/arrow/pull/48549#discussion_r2630255991
##########
cpp/src/arrow/util/rle_encoding_internal.h:
##########
@@ -308,6 +308,23 @@ class RleRunDecoder {
return to_read;
}
+ /// Get a batch of values and count how many equal match_value
+ [[nodiscard]] rle_size_t GetBatchWithCount(value_type* out, rle_size_t
batch_size,
+ rle_size_t value_bit_width,
+ value_type match_value, int64_t*
out_count) {
+ if (ARROW_PREDICT_FALSE(remaining_count_ == 0)) {
+ return 0;
+ }
+
+ const auto to_read = std::min(remaining_count_, batch_size);
+ std::fill(out, out + to_read, value_);
+ if (value_ == match_value) {
+ *out_count += to_read;
+ }
+ remaining_count_ -= to_read;
+ return to_read;
+ }
+
Review Comment:
Could this call `RleRunDecoder::GetBatch` to avoid duplicating the logic?
##########
cpp/src/arrow/util/rle_encoding_internal.h:
##########
@@ -377,6 +394,19 @@ class BitPackedRunDecoder {
return steps;
}
+ /// Get a batch of values and count how many equal match_value
+ /// Note: For bit-packed runs, we use std::count after GetBatch since it's
+ /// highly optimized by the compiler. The fused approach is only beneficial
+ /// for RLE runs where counting is O(1).
+ [[nodiscard]] rle_size_t GetBatchWithCount(value_type* out, rle_size_t
batch_size,
+ rle_size_t value_bit_width,
+ value_type match_value, int64_t*
out_count) {
+ const auto steps = GetBatch(out, batch_size, value_bit_width);
+ // std::count is highly optimized (SIMD) by modern compilers
+ *out_count += std::count(out, out + steps, match_value);
Review Comment:
I have been working on the `unpack` function used in `GetBatch` and my
intuition is also that this function could not easily be extended to count at
the same time as it extract (not impossible but heavy changes).
Still this could provide better data locality when doing run by run.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]