Re: [Parquet][C++, Python]Parallelism of reading Parquet

Weston Pace Mon, 18 Apr 2022 17:04:29 -0700

> Thanks. A followup question on pre buffering. When the caching layer
> caches all the ranges, will they all issue requests to S3
> simultaneously to saturate S3 bandwidth? Or there is also a max of
> parallelism downloading or pipelining technique?


All requests should be issued simultaneously to the I/O thread pool.
The I/O thread pool has a max parallelism (defaults to 8).  This is
configurable.  More info at [1].

> This question may be a little dev detail. Will PreBuffer(aka Cache in
> the caching layer) be effective if only there is WhenBuffered(aka
> WaitFor in caching layer) being called afterwards? I notice that in
> old dataset implementation, only PreBuffer is called but no one wait
> for it.

This logic is governed by ReadRangeCache in src/arrow/io/caching.h.
What follows is my (potentially incorrect) understanding.  There are
two versions, a lazy and an eager version.

The eager version will start the reads as soon as the ranges are
requested.  This results in faster reads and more likely parallel
reads but ranges can only be coalesced within a call to Cache (this
function takes in a vector of ranges).  In this case there is no need
to call Wait/WaitFor to get parallelism.

The lazy version will only start the read when it receives a call to
Read/Wait/WaitFor.  This version allows for ranges to be coalesced
across different calls to Cache.  If a user calls Read without calling
Wait/WaitFor then they will get coalesced ranges but they will not get
parallel reads (unless multiple reads are required to satisfy the
single call to Read because it is larger than the max allowed range
that defaults to 64MiB).

[1] https://arrow.apache.org/docs/dev/cpp/threading.html#thread-pools

On Wed, Apr 13, 2022 at 7:58 PM Xinyu Zeng <[email protected]> wrote:
>
> This question may be a little dev detail. Will PreBuffer(aka Cache in
> the caching layer) be effective if only there is WhenBuffered(aka
> WaitFor in caching layer) being called afterwards? I notice that in
> old dataset implementation, only PreBuffer is called but no one wait
> for it.
>
> On Thu, Apr 14, 2022 at 10:16 AM Xinyu Zeng <[email protected]> wrote:
> >
> > Thanks. A followup question on pre buffering. When the caching layer
> > caches all the ranges, will they all issue requests to S3
> > simultaneously to saturate S3 bandwidth? Or there is also a max of
> > parallelism downloading or pipelining technique?
> >
> > On Thu, Apr 14, 2022 at 4:51 AM Weston Pace <[email protected]> wrote:
> > >
> > > Yes, that matches my understanding as well.  I think, when
> > > pre-buffering is enabled, you might get parallel reads even if you are
> > > using ParquetFile/read_table but I would have to check.  I agree that
> > > it would be a good idea to add some documentation to all the readers
> > > going over our parallelism at a high level.  I created [1] and will
> > > try to update this when I get a chance.
> > >
> > > > I was also wondering how pre_buffer works. Will coalescing ColumnChunk
> > > > ranges hurt parallelism? Or you can still parallelly read a huge range
> > > > after coalescing? To me, coalescing and parallel reading seem like a
> > > > tradeoff on S3?
> > >
> > > It's possible but I think there is a rather small range of files/reads
> > > that would be affected by this.  The coalescing will only close holes
> > > smaller than 8KiB and will only coalesce up to 64MiB.  Generally files
> > > are either larger than 64MiB or there are many files (in which case
> > > the I/O from a single file doesn't really need to be parallel).
> > > Furthermore, if we are not reading all of the columns then the gaps
> > > between columns are larger than 8KiB.
> > >
> > > We did benchmark pre buffering on S3 and, if I remember correctly, the
> > > pre buffering option had a very beneficial impact when running in S3.
> > > AWS recommends reads in the 8MB/16MB range and without pre-buffering I
> > > think our reads are too small to be effective.
> > >
> > > [1] https://issues.apache.org/jira/browse/ARROW-16194
> > >
> > > On Wed, Apr 13, 2022 at 3:16 AM Xinyu Zeng <[email protected]> wrote:
> > > >
> > > > I want to make sure a few of my understanding is correct in this
> > > > thread. There are two ways to read a parquet file in C++, either
> > > > through ParquetFile/read_table, or through ParquetDataset. For the
> > > > former, the parallelism is per column because read_table simply passes
> > > > all row groups indices to DecodeRowGroups in reader.cc, and there is
> > > > no row group level parallelism. For the latter, the parallelism is per
> > > > column and per row group, which is a ColumnChunk, according to
> > > > RowGroupGenerator in file_parquet.cc. The difference between the
> > > > former and the latter is also differentiated by use_legacy_dataset in
> > > > Python. If my understanding is correct, I think this difference may be
> > > > better explained in doc to avoid confusion. I have to crush the code
> > > > to understand.
> > > >
> > > > I was also wondering how pre_buffer works. Will coalescing ColumnChunk
> > > > ranges hurt parallelism? Or you can still parallelly read a huge range
> > > > after coalescing? To me, coalescing and parallel reading seem like a
> > > > tradeoff on S3?
> > > >
> > > > Thanks in advance

Re: [Parquet][C++, Python]Parallelism of reading Parquet

Reply via email to