[GitHub] [arrow] yordan-pavlov commented on pull request #9588: ARROW-11799: [Rust] fix len of string and binary arrays created from unbound iterator

2021-02-27 Thread GitBox


yordan-pavlov commented on pull request #9588:
URL: https://github.com/apache/arrow/pull/9588#issuecomment-787062425


   @nevi-me this probably deserves its own discussion, but you are pretty close 
with your suggestion to avoid ByteArray;
   
   I have been doing quite a lot of profiling and benchmarking in the past few 
weeks on loading string arrays from parquet files, and yes, going through an 
intermediate `Vec` adds a lot of overhead due to extra allocation 
and very interestingly, deallocation (I imagine because of the creation of many 
`ByteArray` objects pointing to the same Arc).
   
   I am happy to discuss this in more detail, but in short my benchmarks show 
that skipping the `ByteArray`s approximately doubles performance, then 
replacing `Vec` with `Iterator` results in some more improvement and finally 
removing the intermediate conversion into `&str` (and just copying the bytes 
instead) results in another doubling of performance for a **total achievable 
performance improvement of about 5 times**. These are still very basic and 
isolated benchmarks, next step is to find a way to apply these learnings to the 
actual parquet reader code.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] yordan-pavlov commented on pull request #9588: ARROW-11799: [Rust] fix len of string and binary arrays created from unbound iterator

2021-03-04 Thread GitBox


yordan-pavlov commented on pull request #9588:
URL: https://github.com/apache/arrow/pull/9588#issuecomment-790979386


   @Dandandan I would be happy to collaborate on this;
   I have been using MS Visual Studio for profiling DataFusion and Arrow, and 
most of the time it works fairly well and gives useful insight.
   From what I have observed, after my change to push filers down to parquet 
(by filtering out entire row groups), now about half the time is spent in 
`ComplexObjectArrayReader::next_batch`; then inside this method
   * about 16% of total runtime is spent on 
`data_buffer.resize_with(batch_size, T::T::default);` where in the case of 
`StringArray`, `data_bufer` is `Vec` 
   * about 8% of total time is spent on `data_buffer.truncate(num_read);`
   * about 10% of total time is spent in 
`data_buffer.into_iter().zip(self.def_levels_buffer.as_ref().unwrap().iter()).map(...).collect()`
   * another 10% of total time is spent in `let mut array = 
self.converter.convert(data)?;` - this is where the `Utf8ArrayConverter` is used
   * more time is also spent at the very end of the `next_batch` function, I 
imagine to deallocate the large number of `ByteArray` objects created
   
   I have managed to achieve about 10-15% improvement by replacing 
`data_buffer.truncate(num_read)` with using an iterator instead, but there is a 
lot more work to do. As I commented above, there is a number of improvements 
that could be done:
   * not converting into intermediate `ByteArray` objects on the way into a 
`StringArray` should result in a significant improvement
   * also using iterators to fetch data loaded from parquet, instead of writing 
values in pre-allocated arrays should avoid a lot of unnecessary allocation
   * for optimal performance, ideally not do any intermediate conversion at 
all, and just copy bytes slices from an iterator (over parquet data pages) into 
an arrow array
   
   All of these changes, I think, will propagate all the way to the low-level 
decoders and will take time, but when done should put us in a good position to 
transition to async using Streams (my understanding is that a Stream is 
effectively an async Iterator).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org