[ 
https://issues.apache.org/jira/browse/ARROW-14663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442880#comment-17442880
 ] 

Weston Pace commented on ARROW-14663:
-------------------------------------

I'll point out that the description for "num_threads" is:

{quote}The number of processing threads to use for initial parsing and lazy 
reading of data. If your data contains newlines within fields the parser should 
automatically detect this and fall back to using one thread only. However if 
you know your file has newlines within quoted fields it is safest to set 
{{num_threads = 1}} explicitly.\{quote}

This comment does not apply to Arrow's CSV reader (fields aren't allowed to 
contain newlines at all if I recall).

Also, an argument in this spot would seem to suggest that Arrow is going to 
spin up "num_threads" _new_ threads specific to the task.  That is not how 
Arrow works.  Arrow has a single thread pool of threads that all tasks share.  
So if you called read_csv multiple times in parallel then you don't create 
additional threads, all tasks will share the same threads.

So I'm also in favor of not supporting a "num_threads" argument.

 

{quote}

Maybe we can allow the use to specify {{write_options}} or {{readr_options}} 
and that will allow them some level of control over the use of the global 
thread pool.

{quote}

This is doable but not trivial.  Without a concrete use case I would be very 
reluctant to investigate further.

> [R] Expose number of threads in read_csv_arrow() and write_csv_arrow()
> ----------------------------------------------------------------------
>
>                 Key: ARROW-14663
>                 URL: https://issues.apache.org/jira/browse/ARROW-14663
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Dragoș Moldovan-Grünfeld
>            Priority: Minor
>
> As of {{readr}} 2.0.0 (and the switch to {{vroom}}) both {{read_csv()}} and 
> {{write_csv()}} allow the user to pass the number of threads to be used when 
> processing (the {{num_threads}} argument). Currently this functionality is 
> not exposed in Arrow. Some functionality (not yet the CSV read or write) 
> allows the user to use the global CPU thread pool, but {{num_threads}} would 
> offer more granular control. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to