[GitHub] [arrow] yordan-pavlov commented on pull request #9588: ARROW-11799: [Rust] fix len of string and binary arrays created from unbound iterator

GitBox Sat, 27 Feb 2021 04:01:45 -0800


yordan-pavlov commented on pull request #9588:
URL: https://github.com/apache/arrow/pull/9588#issuecomment-787062425



   @nevi-me this probably deserves its own discussion, but you are pretty close 
with your suggestion to avoid ByteArray;
   
   I have been doing quite a lot of profiling and benchmarking in the past few 
weeks on loading string arrays from parquet files, and yes, going through an 
intermediate `Vec<ByteArray>` adds a lot of overhead due to extra allocation 
and very interestingly, deallocation (I imagine because of the creation of many 
`ByteArray` objects pointing to the same Arc).
   
   I am happy to discuss this in more detail, but in short my benchmarks show 
that skipping the `ByteArray`s approximately doubles performance, then 
replacing `Vec` with `Iterator` results in some more improvement and finally 
removing the intermediate conversion into `&str` (and just copying the bytes 
instead) results in another doubling of performance for a **total achievable 
performance improvement of about 5 times**. These are still very basic and 
isolated benchmarks, next step is to find a way to apply these learnings to the 
actual parquet reader code.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] yordan-pavlov commented on pull request #9588: ARROW-11799: [Rust] fix len of string and binary arrays created from unbound iterator

Reply via email to