andishgar commented on code in PR #46229:
URL: https://github.com/apache/arrow/pull/46229#discussion_r2176905950


##########
cpp/src/arrow/array/array_binary.cc:
##########
@@ -104,7 +105,14 @@ BinaryViewArray::BinaryViewArray(std::shared_ptr<DataType> 
type, int64_t length,
   SetData(
       ArrayData::Make(std::move(type), length, std::move(buffers), null_count, 
offset));
 }
-
+Result<std::shared_ptr<Array>> BinaryViewArray::CompactArray(MemoryPool* pool) 
const {
+  std::unique_ptr<ArrayBuilder> builder;
+  ARROW_RETURN_NOT_OK(MakeBuilder(pool, type(), &builder));
+  ArraySpan span(*data());
+  // ArraySpan handles offset
+  ARROW_RETURN_NOT_OK(builder->AppendArraySlice(span, 0, span.length));

Review Comment:
   I'm currently reviewing the implementation, and I think we should consider 
making some changes:
   
   1. **Bitmap Buffer**:
      It's possible to avoid copying the bitmap buffer entirely.
   
   2. **View Buffer**:
      For the view buffer, we likely need to make a copy, since it may already 
be shared due to operations like `cast`, `arrow::Array::View`, or 
`arrow::ViewOrCopyTo`.
      However, I wonder if we could use `std::shared_ptr::use_count()` to 
detect whether the array is the sole owner of the view buffer and skip the copy 
in that case.
   
   3. **Data Buffer**:
      Three scenarios come to mind:
   
      3.1. The buffer is entirely empty — in such cases, we can safely discard 
it. This might occur after calling a compute function like `extract_regex` 
(note: this is a hypothetical example, as compute functions don’t support 
`StringView` yet).
   
      3.2. Only a small portion of the data buffer is used — for instance, if 
only one-tenth of the buffer is referenced by the view. In this case, copying 
the used portion to a new buffer helps prevent memory bloat.
   
      3.3. A significant portion of the buffer (e.g., 50% or more) is 
referenced — in such a case, it may be better to preserve the original data 
buffer, unless we're specifically trying to reduce memory usage.
   
   Additionally, there's another concern: it might be useful to expose a public 
API method to indicate whether calling `CompactArray` would be beneficial.
   
   What do you think? Should I integrate this heuristic into `CompactArray`, or 
would it be better to split it into a separate method or even into a different 
pull request?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to