alamb commented on PR #7650: URL: https://github.com/apache/arrow-rs/pull/7650#issuecomment-2984471796
Ok, I figured out what is going on and why the `mixed_utf8view` test is slowing down. The issue is that the new utf8view code is triggering garbage collection (string copying) when the old one did not. I put some `println!` and on main it shows ``` ideal_buffer_size: 370022, actual_buffer_size: 614032 ``` This is right under the cutoff load factor (0.5) that would force a a copy of the strings into new buffers However, on this branch, because the GC happens *after* the input is sliced the overall load factor is smaller which triggers the GC in some cases ``` ideal_buffer_size: 246034, actual_buffer_size: 614032 ideal_buffer_size: 123988, actual_buffer_size: 614032 ideal_buffer_size: 155553, actual_buffer_size: 614032 Need GC ``` If I hard code the gc heuristic to be different ```diff index 0be8702c1b..5e4695dd7e 100644 --- a/arrow-select/src/coalesce/byte_view.rs +++ b/arrow-select/src/coalesce/byte_view.rs @@ -290,7 +290,7 @@ impl<B: ByteViewType> InProgressArray for InProgressByteViewArray<B> { // Copying the strings into a buffer can be time-consuming so // only do it if the array is sparse - if actual_buffer_size > (ideal_buffer_size * 2) { + if actual_buffer_size > (ideal_buffer_size * 100) { self.append_views_and_copy_strings(s.views(), ideal_buffer_size, buffers); } else { self.append_views_and_update_buffer_index(s.views(), buffers); ``` The performance for this benchmark is the same as on main I am thinking about how best to fix this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org