yordan-pavlov commented on pull request #9588: URL: https://github.com/apache/arrow/pull/9588#issuecomment-787062425
@nevi-me this probably deserves its own discussion, but you are pretty close with your suggestion to avoid ByteArray; I have been doing quite a lot of profiling and benchmarking in the past few weeks on loading string arrays from parquet files, and yes, going through an intermediate `Vec<ByteArray>` adds a lot of overhead due to extra allocation and very interestingly, deallocation (I imagine because of the creation of many `ByteArray` objects pointing to the same Arc). I am happy to discuss this in more detail, but in short my benchmarks show that skipping the `ByteArray`s approximately doubles performance, then replacing `Vec` with `Iterator` results in some more improvement and finally removing the intermediate conversion into `&str` (and just copying the bytes instead) results in another doubling of performance for a **total achievable performance improvement of about 5 times**. These are still very basic and isolated benchmarks, next step is to find a way to apply these learnings to the actual parquet reader code. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org