lidavidm opened a new pull request #9482:
URL: https://github.com/apache/arrow/pull/9482
This exposes the pre-buffering option that was implemented for the base
Parquet reader in Datasets.
To summarize, the option coalesces and buffers ranges of a file based on the
columns and row groups read, issuing a single read operation for multiple read
requests, by combining adjacent and "nearby" ranges into a single request. This
means it handles both cases where the entire file is being read, as well as a
subset of columns and/or row groups.
There's not a benchmark for Datasets in the repo, but here's a quick
comparison for loading data from S3:
```
Without buffering:
Data read: 1300.00 MiB
Mean : 7.68 s
Median : 7.48 s
Stdev : 0.76 s
Mean rate: 170.57 MiB/s
With buffering:
Data read: 1300.00 MiB
Mean : 3.88 s
Median : 3.89 s
Stdev : 0.21 s
Mean rate: 335.95 MiB/s
```
The code being benchmarked is essentially:
```python
dataset = pyarrow.dataset.FileSystemDataset.from_paths(
paths,
schema=schema,
format=format,
filesystem=fs,
)
table = dataset.to_table()
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]