[jira] [Commented] (ARROW-14974) [C++] Dataset scanning, in async mode, is running parquet reads on the CPU thread pool

Weston Pace (Jira) Thu, 02 Dec 2021 18:48:06 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452704#comment-17452704
 ]


Weston Pace commented on ARROW-14974:
-------------------------------------

Note: The inverse is often true.  We will sometimes do some compute work on the 
I/O thread pool to avoid a potential loss of cache coherency or creating too 
many thread tasks.  But this is more tolerable.

> [C++] Dataset scanning, in async mode, is running parquet reads on the CPU 
> thread pool
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-14974
>                 URL: https://issues.apache.org/jira/browse/ARROW-14974
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>
> This is something I picked up while doing some profiling a while back.  When 
> running a scan of a large parquet dataset many of the read tasks (e.g. I/O 
> reads) were running on the CPU thread pool.  This could lead to the CPU 
> thread pool being underutilized.
> It might not have a large effect on the parquet read itself (if the reads are 
> slow we are probably I/O bound so one might not notice) but it can cause 
> issues on a more complex query where reading is being interleaved with CPU 
> work (like filtering and joining).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14974) [C++] Dataset scanning, in async mode, is running parquet reads on the CPU thread pool

Reply via email to