andishgar commented on code in PR #46229: URL: https://github.com/apache/arrow/pull/46229#discussion_r2176905950
########## cpp/src/arrow/array/array_binary.cc: ########## @@ -104,7 +105,14 @@ BinaryViewArray::BinaryViewArray(std::shared_ptr<DataType> type, int64_t length, SetData( ArrayData::Make(std::move(type), length, std::move(buffers), null_count, offset)); } - +Result<std::shared_ptr<Array>> BinaryViewArray::CompactArray(MemoryPool* pool) const { + std::unique_ptr<ArrayBuilder> builder; + ARROW_RETURN_NOT_OK(MakeBuilder(pool, type(), &builder)); + ArraySpan span(*data()); + // ArraySpan handles offset + ARROW_RETURN_NOT_OK(builder->AppendArraySlice(span, 0, span.length)); Review Comment: I'm currently reviewing the implementation, and I think we should consider making some changes: 1. **Bitmap Buffer**: It's possible to avoid copying the bitmap buffer entirely. 2. **View Buffer**: For the view buffer, we likely need to make a copy, since it may already be shared due to operations like `cast`, `arrow::Array::View`, or `arrow::ViewOrCopyTo`. However, I wonder if we could use `std::shared_ptr::use_count()` to detect whether the array is the sole owner of the view buffer and skip the copy in that case. 3. **Data Buffer**: Three scenarios come to mind: 3.1. The buffer is entirely empty — in such cases, we can safely discard it. This might occur after calling a compute function like `extract_regex` (note: this is a hypothetical example, as compute functions don’t support `StringView` yet). 3.2. Only a small portion of the data buffer is used — for instance, if only one-tenth of the buffer is referenced by the view. In this case, copying the used portion to a new buffer helps prevent memory bloat. 3.3. A significant portion of the buffer (e.g., 50% or more) is referenced — in such a case, it may be better to preserve the original data buffer, unless we're specifically trying to reduce memory usage. Additionally, there's another concern: it might be useful to expose a public API method to indicate whether calling `CompactArray` would be beneficial. What do you think? Should I integrate this heuristic into `CompactArray`, or would it be better to split it into a separate method or even into a different pull request? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org