andishgar commented on PR #46730:
URL: https://github.com/apache/arrow/pull/46730#issuecomment-3016821941

   @mapleFU @pitrou
   I believe this pull request is related to several other PRs I've submitted. 
Here's a summary:
   
   1- API and Handling of the Last Buffer
   In [this pull request](https://github.com/apache/arrow/pull/46655), I 
demonstrated that it’s possible to [share 
buffers](https://github.com/apache/arrow/blob/a5dfadba3626c082235d9ea22db6f2cb22398d9a/cpp/src/arrow/array/builder_binary.cc#L90)
 without copying or finalizing the last buffer. This avoids [relocating the 
buffer](https://github.com/apache/arrow/blob/ed13cedd8bf7ddc06db152f97e68d86c2c37e949/cpp/src/arrow/array/builder_binary.h#L563)
 to remove blank space, which can be a costly operation when the unused space 
exceeds 64 bytes.
   
   2-
   >Is it a win, though? If most Parquet strings are <= 12 bytes we would 
pointlessly waste space and CPU time.
   
   In [this pull request](https://github.com/apache/arrow/pull/46229), I 
proposed a method that could help avoid memory bloat when buffers are shared. 
Additionally, in [this issue](https://github.com/apache/arrow/issues/45639), I 
think this metadata could help determine when CompactArray should be called.
   
   Overall, my suggestion is to either modify this pull request or create a new 
API to support buffer sharing. It is possible to decide whether a created array 
should be compacted based on some metadata, in order to avoid memory bloat.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to