pitrou commented on PR #46730: URL: https://github.com/apache/arrow/pull/46730#issuecomment-3022421404
> 1- API and Handling of the Last Buffer In [this pull request](https://github.com/apache/arrow/pull/46655), I demonstrated that it’s possible to [share buffers](https://github.com/apache/arrow/blob/a5dfadba3626c082235d9ea22db6f2cb22398d9a/cpp/src/arrow/array/builder_binary.cc#L90) without copying or finalizing the last buffer. This avoids [relocating the buffer](https://github.com/apache/arrow/blob/ed13cedd8bf7ddc06db152f97e68d86c2c37e949/cpp/src/arrow/array/builder_binary.h#L563) to remove blank space, which can be a costly operation when the unused space exceeds 64 bytes. > > 2- > > > Is it a win, though? If most Parquet strings are <= 12 bytes we would pointlessly waste space and CPU time. > > In [this pull request](https://github.com/apache/arrow/pull/46229), I proposed a method that could help avoid memory bloat when buffers are shared. Additionally, in [this issue](https://github.com/apache/arrow/issues/45639), I think this metadata could help determine when CompactArray should be called. Thanks for the reminder, and sorry that this is taking a long time :) I propose that we review these PRs one by one. I've started with the `CompactArray` one and, once that is done, I would like to then move to the `AppendArraySlice` improvement. This PR here is slightly more contentious so I think we should tackle it only after the other APIs have settled semantics. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
