[
https://issues.apache.org/jira/browse/ARROW-14663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442880#comment-17442880
]
Weston Pace commented on ARROW-14663:
-------------------------------------
I'll point out that the description for "num_threads" is:
{quote}The number of processing threads to use for initial parsing and lazy
reading of data. If your data contains newlines within fields the parser should
automatically detect this and fall back to using one thread only. However if
you know your file has newlines within quoted fields it is safest to set
{{num_threads = 1}} explicitly.\{quote}
This comment does not apply to Arrow's CSV reader (fields aren't allowed to
contain newlines at all if I recall).
Also, an argument in this spot would seem to suggest that Arrow is going to
spin up "num_threads" _new_ threads specific to the task. That is not how
Arrow works. Arrow has a single thread pool of threads that all tasks share.
So if you called read_csv multiple times in parallel then you don't create
additional threads, all tasks will share the same threads.
So I'm also in favor of not supporting a "num_threads" argument.
{quote}
Maybe we can allow the use to specify {{write_options}} or {{readr_options}}
and that will allow them some level of control over the use of the global
thread pool.
{quote}
This is doable but not trivial. Without a concrete use case I would be very
reluctant to investigate further.
> [R] Expose number of threads in read_csv_arrow() and write_csv_arrow()
> ----------------------------------------------------------------------
>
> Key: ARROW-14663
> URL: https://issues.apache.org/jira/browse/ARROW-14663
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Dragoș Moldovan-Grünfeld
> Priority: Minor
>
> As of {{readr}} 2.0.0 (and the switch to {{vroom}}) both {{read_csv()}} and
> {{write_csv()}} allow the user to pass the number of threads to be used when
> processing (the {{num_threads}} argument). Currently this functionality is
> not exposed in Arrow. Some functionality (not yet the CSV read or write)
> allows the user to use the global CPU thread pool, but {{num_threads}} would
> offer more granular control.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)