The parquet::arrow::FileReader class takes in parquest::ArrowReaderProperties which have a use_threads option. If true then the reader will parallelize column reads. This flag is used in parquet/arrow/reader.cc to parallelize column reads (search for OptionalParallelFor).
This may or may not trigger the actual reading. If prebuffering is off then there is a NextBatch call which will trigger the read needed for the column. However, if prebuffering is on (best for performance) then an attempt will be made to combine reads following rules in src/arrow/io/caching.h. This might be a fun place to do some experiments if you'd like. In src/arrow/io/caching.cc you will see actual calls to arrow::io::RandomAccessFile::ReadAsync. Keep in mind this is all with regards to the latest commits. Some work has been done here since 3.0. Also keep in mind that the use_threads flag is forced off when reading multiple files as part of a dataset scan. This happens in arrow::dataset::MakeArrowReaderProperties inside of file_parquet.cc. I am currently working on ARROW-7001 which will allow us to keep the parallel reads. That JIRA issue explains the issues faced by this. I'm happy to provide more information if you'd like but I hope this gets you started. On Tue, Mar 16, 2021 at 8:00 AM Yeshwanth Sriram <[email protected]> wrote: > > Hello, > > I’ve managed to implement ADLFS/gen2 filesystem with reader/writers. I’m also > able to read through data from ADLFS via parquet reader using my > implementation. It is modeled like the s3fs implementation. > > Question. > - Is way to parallelize the column read operation using multiple threads in > parquet/reader. > - Can someone point to code in parquet subsystem where the final call is > dispatched to the underlying random access file object. > > Thank you > Yesh
