pitrou commented on code in PR #46229:
URL: https://github.com/apache/arrow/pull/46229#discussion_r2207484676


##########
cpp/src/arrow/array/array_binary.cc:
##########
@@ -105,6 +111,392 @@ 
BinaryViewArray::BinaryViewArray(std::shared_ptr<DataType> type, int64_t length,
       ArrayData::Make(std::move(type), length, std::move(buffers), null_count, 
offset));
 }
 
+namespace {
+
+// TODO Should We move this to bitmap_ops.h and Remove from 
compute/kernels/util.s
+Result<std::shared_ptr<Buffer>> GetOrCopyNullBitmapBuffer(const ArrayData& 
in_array,
+                                                          MemoryPool* pool) {
+  if (in_array.buffers[0]->data() == nullptr) {
+    return nullptr;
+  } else if (in_array.offset == 0) {
+    return in_array.buffers[0];
+  } else if (in_array.offset % 8 == 0) {
+    return SliceBuffer(in_array.buffers[0], /*offset=*/in_array.offset / 8);
+  } else {
+    // If a non-zero offset, we need to shift the bitmap
+    return internal::CopyBitmap(pool, in_array.buffers[0]->data(), 
in_array.offset,
+                                in_array.length);
+  }
+}
+
+struct Interval {
+  int64_t start;
+  int64_t end;
+  int32_t offset = -1;
+};
+
+struct IntervalComparator {
+  bool operator()(const Interval& left, const Interval& right) const {
+    return left.start < right.start;
+  }
+};
+
+// inspired from boost::icl::interval_set
+class IntervalMerger {
+ public:
+  using IntervalSet = std::set<Interval, IntervalComparator>;
+  using Iterator = std::set<Interval, IntervalComparator>::iterator;
+
+  void AddInterval(const Interval& interval) {
+    auto [it, is_inserted] = interval_set.insert(interval);

Review Comment:
   While this approach is algorithmically correct, we should also be mindful of 
real-world data characteristics.
   
   In typical BinaryViewArrays, the views will be 1) contiguous and/or 2) 
ordered. Actually, many BinaryViewArrays will be already compact (except 
perhaps at the start/end) and we want to be able to handle this case quickly.
   
   So having a `O(n log n)` compaction algorithm that allocates many temporary 
nodes is wasteful for these typical cases. What I would suggest the 
`AddInterval` method does is:
   1) check whether the interval comes after the last set element; if so, try 
to merge them and, if not possible, insert interval at the end (this is `O(1)` 
and avoids an allocation if the intervals can be merged)
   2) otherwise, use the `lower_bound` method to find the interval just before; 
try to merge them and, if not possible, insert interval just after (this is 
`O(1)` but still avoids an allocation if the intervals can be merged)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to