alamb commented on PR #6062: URL: https://github.com/apache/arrow-rs/pull/6062#issuecomment-2228883884
> Here is a quick benchmark, and the result looks reasonable. I agree that this result looks reasonable given this PR doesn't have any StringView specific optimizations. It is unfortunate, but not unexpected, that creating a `StringView` will be slower than `StringArray` if the strings are copied > Some thoughts on reusing the buffer: CSV is row format, making it difficult to reuse the underlying buffer because we will likely hold the entire file in memory. So I think it makes sense to copy the strings to new place. For some usecases (like streaming read + filter) I think might make sense to reuse the buffers (the rationale being that the extra memory usage would be for a short period of time, and many of the rows are likely to be filered out). And users could always call https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.gc to copy / compact the strings if desired (or simply read as StringView) For this PR I think starting simple is good and we can file a ticket to optimize the implementation later -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
