lidavidm commented on pull request #11535: URL: https://github.com/apache/arrow/pull/11535#issuecomment-951049087
I tested this with minio and toxiproxy set up with `toxiproxy-cli-linux-amd64 toxic add -n latency -t latency --attribute latency=100 s3`. Now, this is rather unrealistic - this is a lot more latency than you should expect from S3, unless you're doing a cross-region read - but it highlights the cost of I/O in this case. Median times are given below. Three methods are compared: iterating through all record batches, iterating through all batches using the generator (which also uses coalescing), and using Datasets (async scanner) to read the data as a table. ``` Baseline: Iterator: 5.54072s Generator: 0.560195s Datasets: 1.39329s With the IPC message optimization: Iterator: 2.95526s Generator: 0.561748s Datasets: 1.39662s With the IPC message optimization and the footer optimization: Iterator: 2.84875s Generator: 0.456949s Datasets: 1.08955s ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
