lidavidm opened a new pull request #9482:
URL: https://github.com/apache/arrow/pull/9482


   This exposes the pre-buffering option that was implemented for the base 
Parquet reader in Datasets.
   
   To summarize, the option coalesces and buffers ranges of a file based on the 
columns and row groups read, issuing a single read operation for multiple read 
requests, by combining adjacent and "nearby" ranges into a single request. This 
means it handles both cases where the entire file is being read, as well as a 
subset of columns and/or row groups.
   
   There's not a benchmark for Datasets in the repo, but here's a quick 
comparison for loading data from S3:
   
   ```
   Without buffering:
   Data read: 1300.00 MiB
   Mean     : 7.68 s
   Median   : 7.48 s
   Stdev    : 0.76 s
   Mean rate: 170.57 MiB/s
   
   With buffering:
   Data read: 1300.00 MiB
   Mean     : 3.88 s
   Median   : 3.89 s
   Stdev    : 0.21 s
   Mean rate: 335.95 MiB/s
   ```
   
   The code being benchmarked is essentially:
   
   ```python
   dataset = pyarrow.dataset.FileSystemDataset.from_paths(
       paths,
       schema=schema,
       format=format,
       filesystem=fs,
   )
   table = dataset.to_table()
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to