Re: [Parquet][C++, Python]Parallelism of reading Parquet

Xinyu Zeng Wed, 13 Apr 2022 22:58:06 -0700

This question may be a little dev detail. Will PreBuffer(aka Cache in
the caching layer) be effective if only there is WhenBuffered(aka
WaitFor in caching layer) being called afterwards? I notice that in
old dataset implementation, only PreBuffer is called but no one wait
for it.


On Thu, Apr 14, 2022 at 10:16 AM Xinyu Zeng <[email protected]> wrote:
>
> Thanks. A followup question on pre buffering. When the caching layer
> caches all the ranges, will they all issue requests to S3
> simultaneously to saturate S3 bandwidth? Or there is also a max of
> parallelism downloading or pipelining technique?
>
> On Thu, Apr 14, 2022 at 4:51 AM Weston Pace <[email protected]> wrote:
> >
> > Yes, that matches my understanding as well.  I think, when
> > pre-buffering is enabled, you might get parallel reads even if you are
> > using ParquetFile/read_table but I would have to check.  I agree that
> > it would be a good idea to add some documentation to all the readers
> > going over our parallelism at a high level.  I created [1] and will
> > try to update this when I get a chance.
> >
> > > I was also wondering how pre_buffer works. Will coalescing ColumnChunk
> > > ranges hurt parallelism? Or you can still parallelly read a huge range
> > > after coalescing? To me, coalescing and parallel reading seem like a
> > > tradeoff on S3?
> >
> > It's possible but I think there is a rather small range of files/reads
> > that would be affected by this.  The coalescing will only close holes
> > smaller than 8KiB and will only coalesce up to 64MiB.  Generally files
> > are either larger than 64MiB or there are many files (in which case
> > the I/O from a single file doesn't really need to be parallel).
> > Furthermore, if we are not reading all of the columns then the gaps
> > between columns are larger than 8KiB.
> >
> > We did benchmark pre buffering on S3 and, if I remember correctly, the
> > pre buffering option had a very beneficial impact when running in S3.
> > AWS recommends reads in the 8MB/16MB range and without pre-buffering I
> > think our reads are too small to be effective.
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-16194
> >
> > On Wed, Apr 13, 2022 at 3:16 AM Xinyu Zeng <[email protected]> wrote:
> > >
> > > I want to make sure a few of my understanding is correct in this
> > > thread. There are two ways to read a parquet file in C++, either
> > > through ParquetFile/read_table, or through ParquetDataset. For the
> > > former, the parallelism is per column because read_table simply passes
> > > all row groups indices to DecodeRowGroups in reader.cc, and there is
> > > no row group level parallelism. For the latter, the parallelism is per
> > > column and per row group, which is a ColumnChunk, according to
> > > RowGroupGenerator in file_parquet.cc. The difference between the
> > > former and the latter is also differentiated by use_legacy_dataset in
> > > Python. If my understanding is correct, I think this difference may be
> > > better explained in doc to avoid confusion. I have to crush the code
> > > to understand.
> > >
> > > I was also wondering how pre_buffer works. Will coalescing ColumnChunk
> > > ranges hurt parallelism? Or you can still parallelly read a huge range
> > > after coalescing? To me, coalescing and parallel reading seem like a
> > > tradeoff on S3?
> > >
> > > Thanks in advance

Re: [Parquet][C++, Python]Parallelism of reading Parquet

Reply via email to