friendlymatthew opened a new pull request, #7922:
URL: https://github.com/apache/arrow-rs/pull/7922

   # Which issue does this PR close?
   
   - Part of https://github.com/apache/arrow-rs/pull/7896
   
   # Rationale for this change
   
   In https://github.com/apache/arrow-rs/pull/7896, we saw that inserting a 
large amount of field names takes a long time -- in this case ~45s to insert 
2**24 field names. The bulk of this time is spent just allocating the strings, 
but we also see quite a bit of time spent reallocating the `IndexSet` that 
we're inserting into. 
   
   `with_field_names` is an optimization to declare the field names upfront 
which avoids having to reallocate and rehash the entire `IndexSet` during field 
name insertion. Using this method requires at least 2 string allocations for 
each field name -- 1 to declare field names upfront and 1 to insert the actual 
field name during object building.
   
   This PR adds a new method `with_field_name_capacity` which allows you to 
reserve space to the metadata builder, without needing to allocate the field 
names themselves upfront. In this case, we see a modest performance improvement 
when inserting the field names during object building
   
   Before: 
   <img width="1512" height="829" alt="Screenshot 2025-07-13 at 12 08 43 PM" 
src="https://github.com/user-attachments/assets/6ef0d9fe-1e08-4d3a-8f6b-703de550865c";
 />
   
   
   After:
   <img width="1512" height="805" alt="Screenshot 2025-07-13 at 12 08 55 PM" 
src="https://github.com/user-attachments/assets/2faca4cb-0a51-441b-ab6c-5baa1dae84b3";
 />
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to