Re: [C++] - How to parallelize parquet column read operation

Weston Pace Tue, 16 Mar 2021 17:05:29 -0700

The parquet::arrow::FileReader class takes in
parquest::ArrowReaderProperties which have a use_threads option.  If
true then the reader will parallelize column reads.  This flag is used
in parquet/arrow/reader.cc to parallelize column reads (search for
OptionalParallelFor).

This may or may not trigger the actual reading.  If prebuffering is
off then there is a NextBatch call which will trigger the read needed
for the column.

However, if prebuffering is on (best for performance) then an attempt
will be made to combine reads following rules in
src/arrow/io/caching.h.  This might be a fun place to do some
experiments if you'd like.  In src/arrow/io/caching.cc you will see
actual calls to arrow::io::RandomAccessFile::ReadAsync.  Keep in mind
this is all with regards to the latest commits.  Some work has been
done here since 3.0.

Also keep in mind that the use_threads flag is forced off when reading
multiple files as part of a dataset scan.  This happens in
arrow::dataset::MakeArrowReaderProperties inside of file_parquet.cc.
I am currently working on ARROW-7001 which will allow us to keep the
parallel reads.  That JIRA issue explains the issues faced by this.

I'm happy to provide more information if you'd like but I hope this
gets you started.

On Tue, Mar 16, 2021 at 8:00 AM Yeshwanth Sriram <[email protected]> wrote:
>
> Hello,
>
> I’ve managed to implement ADLFS/gen2 filesystem with reader/writers. I’m also 
> able to read through data from ADLFS via parquet reader using my 
> implementation. It is modeled like the s3fs implementation.
>
> Question.
> - Is way to parallelize the column read operation using multiple threads in 
> parquet/reader.
> - Can someone point to code in parquet subsystem where the final call is 
> dispatched to the underlying random access file object.
>
> Thank you
> Yesh

Re: [C++] - How to parallelize parquet column read operation

Reply via email to