yordan-pavlov edited a comment on pull request #9588:
URL: https://github.com/apache/arrow/pull/9588#issuecomment-790979386


   @Dandandan I would be happy to collaborate on this;
   I have been using MS Visual Studio for profiling DataFusion and Arrow, and 
most of the time it works fairly well and gives useful insight.
   From what I have observed, after my change to push filers down to parquet 
(by filtering out entire row groups), now about half the time is spent in 
`ComplexObjectArrayReader::next_batch`; then inside this method
   * about 16% of total runtime is spent on 
`data_buffer.resize_with(batch_size, T::T::default);` where in the case of 
`StringArray`, `data_bufer` is `Vec<ByteArray>` 
   * about 8% of total time is spent on `data_buffer.truncate(num_read);`
   * about 10% of total time is spent in 
`data_buffer.into_iter().zip(self.def_levels_buffer.as_ref().unwrap().iter()).map(...).collect()`
   * another 10% of total time is spent in `let mut array = 
self.converter.convert(data)?;` - this is where the `Utf8ArrayConverter` is used
   * more time is also spent at the very end of the `next_batch` function, I 
imagine to deallocate the large number of `ByteArray` objects created
   
   I have managed to achieve about 10-15% improvement by replacing 
`data_buffer.truncate(num_read)` with using an iterator instead, but there is a 
lot more work to do. As I commented above, there is a number of improvements 
that could be done:
   * not converting into intermediate `ByteArray` objects on the way into a 
`StringArray` should result in a significant improvement
   * also using iterators to fetch data loaded from parquet, instead of writing 
values in pre-allocated arrays should avoid a lot of unnecessary allocation
   * for optimal performance, ideally not do any intermediate conversion at 
all, and just copy bytes slices from an iterator (over parquet data pages) into 
an arrow array
   
   All of these changes, I think, will propagate all the way to the low-level 
decoders and will take time, but when done should put us in a good position to 
transition to async using Streams (my understanding is that a Stream is 
effectively an async Iterator).
   I am still figuring out how to make this work exactly and how to split the 
work into smaller pieces which could be done over time. I hope to be able to 
make more progress on this in the next few days.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to