yordan-pavlov commented on pull request #9588:
URL: https://github.com/apache/arrow/pull/9588#issuecomment-787062425


   @nevi-me this probably deserves its own discussion, but you are pretty close 
with your suggestion to avoid ByteArray;
   
   I have been doing quite a lot of profiling and benchmarking in the past few 
weeks on loading string arrays from parquet files, and yes, going through an 
intermediate `Vec<ByteArray>` adds a lot of overhead due to extra allocation 
and very interestingly, deallocation (I imagine because of the creation of many 
`ByteArray` objects pointing to the same Arc).
   
   I am happy to discuss this in more detail, but in short my benchmarks show 
that skipping the `ByteArray`s approximately doubles performance, then 
replacing `Vec` with `Iterator` results in some more improvement and finally 
removing the intermediate conversion into `&str` (and just copying the bytes 
instead) results in another doubling of performance for a **total achievable 
performance improvement of about 5 times**. These are still very basic and 
isolated benchmarks, next step is to find a way to apply these learnings to the 
actual parquet reader code.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to