alamb commented on PR #7650:
URL: https://github.com/apache/arrow-rs/pull/7650#issuecomment-2984471796

   Ok, I figured out what is going on and why the `mixed_utf8view` test is 
slowing down. The issue is that the new utf8view code is triggering garbage 
collection (string copying) when the old one did not. I put some `println!` and 
on main it shows
   
   ```
   ideal_buffer_size: 370022, actual_buffer_size: 614032
   ```
   This is right under the cutoff load factor (0.5) that would force a a copy 
of the strings into new buffers
   
   However, on this branch, because the GC happens *after* the input is sliced 
the overall load factor is smaller which triggers the GC in some cases
   
   
   ```
   ideal_buffer_size: 246034, actual_buffer_size: 614032
   ideal_buffer_size: 123988, actual_buffer_size: 614032
   ideal_buffer_size: 155553, actual_buffer_size: 614032
   Need GC
   ```
   
   
   If I hard code the gc heuristic to be different
   
   ```diff
   index 0be8702c1b..5e4695dd7e 100644
   --- a/arrow-select/src/coalesce/byte_view.rs
   +++ b/arrow-select/src/coalesce/byte_view.rs
   @@ -290,7 +290,7 @@ impl<B: ByteViewType> InProgressArray for 
InProgressByteViewArray<B> {
   
            // Copying the strings into a buffer can be time-consuming so
            // only do it if the array is sparse
   -        if actual_buffer_size > (ideal_buffer_size * 2) {
   +        if actual_buffer_size > (ideal_buffer_size * 100) {
                self.append_views_and_copy_strings(s.views(), 
ideal_buffer_size, buffers);
            } else {
                self.append_views_and_update_buffer_index(s.views(), buffers);
   ```
   
   The performance for this benchmark is the same as on main
   
   I am thinking about how best to fix this
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to