[GitHub] [arrow] wesm commented on pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

GitBox Tue, 15 Sep 2020 10:13:09 -0700


wesm commented on pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#issuecomment-692852532



   In terms of benchmarking, it also strikes me that one issue is that it may 
be faster (especially on machines with a lot of cores -- e.g. 16/20 core 
servers) to read a 2-file (or even n-file where n is some number less than the 
number of cores on the machine) dataset by reading the files one at a time 
rather than using the datasets API. How many files do you have to have before 
the performance issue goes away? This is something that would be good to 
quantify in a collection of benchmarks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] wesm commented on pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Reply via email to