[ https://issues.apache.org/jira/browse/ARROW-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452704#comment-17452704 ]
Weston Pace commented on ARROW-14974: ------------------------------------- Note: The inverse is often true. We will sometimes do some compute work on the I/O thread pool to avoid a potential loss of cache coherency or creating too many thread tasks. But this is more tolerable. > [C++] Dataset scanning, in async mode, is running parquet reads on the CPU > thread pool > -------------------------------------------------------------------------------------- > > Key: ARROW-14974 > URL: https://issues.apache.org/jira/browse/ARROW-14974 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Weston Pace > Priority: Major > > This is something I picked up while doing some profiling a while back. When > running a scan of a large parquet dataset many of the read tasks (e.g. I/O > reads) were running on the CPU thread pool. This could lead to the CPU > thread pool being underutilized. > It might not have a large effect on the parquet read itself (if the reads are > slow we are probably I/O bound so one might not notice) but it can cause > issues on a more complex query where reading is being interleaved with CPU > work (like filtering and joining). -- This message was sent by Atlassian Jira (v8.20.1#820001)