marsupialtail opened a new pull request, #13799:
URL: https://github.com/apache/arrow/pull/13799

   This exposes the Fragment Readahead and Batch Readahead flags in the C++ 
Scanner to the user in Python.
   
   This can be used to finetune RAM usage and IO utilization during downloading 
large files from S3 or other network sources. I believe the default settings 
are overly conservative for small RAM settings and I observe less than 20% IO 
utilization on some instances on AWS. 
   
   The Python API is exposed only to methods where these flags make sense. 
Scanning from a RecordBatchIterator won't need those these flags nor will those 
flags make sense. Only the latter flag makes sense for making a scanner from a 
fragment. 
   
   To test this, set up an i3.2xlarge instance on AWS: 
   
   ```
   import pyarrow
   import pyarrow.dataset as ds
   import pyarrow.csv as csv
   import time
   pyarrow.set_cpu_count(8)
   pyarrow.set_io_thread_count(16)
   lineitem_scheme = 
["l_orderkey","l_partkey","l_suppkey","l_linenumber","l_quantity","l_extendedprice",
   
"l_discount","l_tax","l_returnflag","l_linestatus","l_shipdate","l_commitdate","l_receiptdate","l_shipinstruct",
   "l_shipmode","l_comment", "null"]
   csv_format = 
ds.CsvFileFormat(read_options=csv.ReadOptions(column_names=lineitem_scheme, 
block_size= 32 * 1024 * 1024), parse_options=csv.ParseOptions(delimiter="|"))
   dataset = ds.dataset("s3://TPC",format=csv_format)
   s = dataset.to_batches(batch_size=1000000000)
   while count < 100:
       z = next(s)
   ```
   For our purposes let's just make the TPC dataset consist of hundreds of 
Parquet files each with one row group. (something that Spark would generate). 
This script would get somewhere around 1Gbps. If you now do
   ```
   s = dataset.to_batches(batch_size=1000000000, fragment_readahead=16)
   ```
   You can get to 2.5Gbps which is the advertised steady rate cap for this 
instance type.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to