[ 
https://issues.apache.org/jira/browse/ARROW-14663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442290#comment-17442290
 ] 

Jonathan Keane commented on ARROW-14663:
----------------------------------------

In my experience, CSV reading already does use the global thread pool:

{code:r}
library(arrow, warn.conflict = FALSE)
#> See arrow_info() for available features

set_cpu_count(12)
system.time({read_csv_arrow("nyctaxi_2010-01.csv", as_data_frame = FALSE)})
#>    user  system elapsed 
#>  20.328   6.747   3.292

set_cpu_count(1)
system.time({read_csv_arrow("nyctaxi_2010-01.csv", as_data_frame = FALSE)})
#>    user  system elapsed 
#>  11.490   1.236  12.788
{code}

So that's good. I agree with [~npr] that exposing {{num_threads}} just for csv 
is a bit odd, and since arrow does many things and likely people want to set 
how many threads to use globally, having the helper and advertising that is 
probably the right answer.

Something that *is* confusing here is that we have 
{{cpu_count()}}/{{set_cpu_count}}, 
{{io_thread_count()}}/{{set_io_thread_count}}, and options("arrow.use_threads") 
and there isn't a good explanation of what the difference is between them (I 
thought we had made a jira about explaining this difference — even if it is 
only/mostly developer focused). 

We should probably add details to the read_csv_arrow documentation linking to 
{{cpu_count()}}/{{set_cpu_count}} as the way to control multi-threading. And 
possibly refine the documentation about the others (or set them to 
keywords-internal if we don't expect users to need to or want to set them)

> [R] Expose number of threads in read_csv_arrow() and write_csv_arrow()
> ----------------------------------------------------------------------
>
>                 Key: ARROW-14663
>                 URL: https://issues.apache.org/jira/browse/ARROW-14663
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Dragoș Moldovan-Grünfeld
>            Priority: Minor
>
> As of {{readr}} 2.0.0 (and the switch to {{vroom}}) both {{read_csv()}} and 
> {{write_csv()}} allow the user to pass the number of threads to be used when 
> processing (the {{num_threads}} argument). Currently this functionality is 
> not exposed in Arrow. Some functionality (not yet the CSV read or write) 
> allows the user to use the global CPU thread pool, but {{num_threads}} would 
> offer more granular control. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to