pitrou commented on code in PR #46229: URL: https://github.com/apache/arrow/pull/46229#discussion_r2207484676
########## cpp/src/arrow/array/array_binary.cc: ########## @@ -105,6 +111,392 @@ BinaryViewArray::BinaryViewArray(std::shared_ptr<DataType> type, int64_t length, ArrayData::Make(std::move(type), length, std::move(buffers), null_count, offset)); } +namespace { + +// TODO Should We move this to bitmap_ops.h and Remove from compute/kernels/util.s +Result<std::shared_ptr<Buffer>> GetOrCopyNullBitmapBuffer(const ArrayData& in_array, + MemoryPool* pool) { + if (in_array.buffers[0]->data() == nullptr) { + return nullptr; + } else if (in_array.offset == 0) { + return in_array.buffers[0]; + } else if (in_array.offset % 8 == 0) { + return SliceBuffer(in_array.buffers[0], /*offset=*/in_array.offset / 8); + } else { + // If a non-zero offset, we need to shift the bitmap + return internal::CopyBitmap(pool, in_array.buffers[0]->data(), in_array.offset, + in_array.length); + } +} + +struct Interval { + int64_t start; + int64_t end; + int32_t offset = -1; +}; + +struct IntervalComparator { + bool operator()(const Interval& left, const Interval& right) const { + return left.start < right.start; + } +}; + +// inspired from boost::icl::interval_set +class IntervalMerger { + public: + using IntervalSet = std::set<Interval, IntervalComparator>; + using Iterator = std::set<Interval, IntervalComparator>::iterator; + + void AddInterval(const Interval& interval) { + auto [it, is_inserted] = interval_set.insert(interval); Review Comment: While this approach is algorithmically correct, we should also be mindful of real-world data characteristics. In typical BinaryViewArrays, the views will be 1) contiguous and/or 2) ordered. Actually, many BinaryViewArrays will be already compact (except perhaps at the start/end) and we want to be able to handle this case quickly. So having a `O(n log n)` compaction algorithm that allocates many temporary nodes is wasteful for these typical cases. What I would suggest the `AddInterval` method does is: 1) check whether the interval comes after the last set element; if so, try to merge them and, if not possible, insert interval at the end (this is `O(1)` and avoids an allocation if the intervals can be merged) 2) otherwise, use the `lower_bound` method to find the interval just before; try to merge them and, if not possible, insert interval just after (this is `O(1)` but still avoids an allocation if the intervals can be merged) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org