This question may be a little dev detail. Will PreBuffer(aka Cache in the caching layer) be effective if only there is WhenBuffered(aka WaitFor in caching layer) being called afterwards? I notice that in old dataset implementation, only PreBuffer is called but no one wait for it.
On Thu, Apr 14, 2022 at 10:16 AM Xinyu Zeng <[email protected]> wrote: > > Thanks. A followup question on pre buffering. When the caching layer > caches all the ranges, will they all issue requests to S3 > simultaneously to saturate S3 bandwidth? Or there is also a max of > parallelism downloading or pipelining technique? > > On Thu, Apr 14, 2022 at 4:51 AM Weston Pace <[email protected]> wrote: > > > > Yes, that matches my understanding as well. I think, when > > pre-buffering is enabled, you might get parallel reads even if you are > > using ParquetFile/read_table but I would have to check. I agree that > > it would be a good idea to add some documentation to all the readers > > going over our parallelism at a high level. I created [1] and will > > try to update this when I get a chance. > > > > > I was also wondering how pre_buffer works. Will coalescing ColumnChunk > > > ranges hurt parallelism? Or you can still parallelly read a huge range > > > after coalescing? To me, coalescing and parallel reading seem like a > > > tradeoff on S3? > > > > It's possible but I think there is a rather small range of files/reads > > that would be affected by this. The coalescing will only close holes > > smaller than 8KiB and will only coalesce up to 64MiB. Generally files > > are either larger than 64MiB or there are many files (in which case > > the I/O from a single file doesn't really need to be parallel). > > Furthermore, if we are not reading all of the columns then the gaps > > between columns are larger than 8KiB. > > > > We did benchmark pre buffering on S3 and, if I remember correctly, the > > pre buffering option had a very beneficial impact when running in S3. > > AWS recommends reads in the 8MB/16MB range and without pre-buffering I > > think our reads are too small to be effective. > > > > [1] https://issues.apache.org/jira/browse/ARROW-16194 > > > > On Wed, Apr 13, 2022 at 3:16 AM Xinyu Zeng <[email protected]> wrote: > > > > > > I want to make sure a few of my understanding is correct in this > > > thread. There are two ways to read a parquet file in C++, either > > > through ParquetFile/read_table, or through ParquetDataset. For the > > > former, the parallelism is per column because read_table simply passes > > > all row groups indices to DecodeRowGroups in reader.cc, and there is > > > no row group level parallelism. For the latter, the parallelism is per > > > column and per row group, which is a ColumnChunk, according to > > > RowGroupGenerator in file_parquet.cc. The difference between the > > > former and the latter is also differentiated by use_legacy_dataset in > > > Python. If my understanding is correct, I think this difference may be > > > better explained in doc to avoid confusion. I have to crush the code > > > to understand. > > > > > > I was also wondering how pre_buffer works. Will coalescing ColumnChunk > > > ranges hurt parallelism? Or you can still parallelly read a huge range > > > after coalescing? To me, coalescing and parallel reading seem like a > > > tradeoff on S3? > > > > > > Thanks in advance
