[jira] [Commented] (ARROW-13317) [Python] Improve documentation on what 'use_threads' does in 'read_feather'

Weston Pace (Jira) Mon, 12 Jul 2021 13:39:18 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379405#comment-17379405
 ]


Weston Pace commented on ARROW-13317:
-------------------------------------

The RecordBatchFileReader reader (which the feather reader will be using behind 
the scenes) has a use_threads option which should control this.  Is 
read_feather simply being kept alive for backwards compatibility (in which case 
we should not make it more configurable and should probably mark it deprecated) 
or is it going to be maintained as a separate API and a simpler frontend to 
RecordBatchFileReader (I think I'll send a ML topic with this question 
actually)?

Also, now that I look at it, RecordBatchFileReader in python doesn't expose the 
IpcReadOptions at all.  So a python change would need to be made to expose this 
too.

I don't know about mentioning set_cpu_count.  It does solve the problem but 
it's more of a "global" setting as it will affect how many files are read at 
once by dataset scans, parquet parallelism, and even compute level parallelism 
(once that has more support).  We probably don't want to reference it 
everywhere that it affects.

> [Python] Improve documentation on what 'use_threads' does in 'read_feather'
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-13317
>                 URL: https://issues.apache.org/jira/browse/ARROW-13317
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 4.0.1
>            Reporter: Arun Joseph
>            Priority: Trivial
>              Labels: documentation
>
> The current documentation for 
> [read_feather|https://arrow.apache.org/docs/python/generated/pyarrow.feather.read_feather.html]
>  states the following:
> *use_threads* (_bool__,_ _default True_) – Whether to parallelize reading 
> using multiple threads.
> if the underlying file uses compression, then multiple threads can still be 
> spawned. The verbiage of the *use_threads* is ambiguous on whether the 
> restriction on multiple threads is only for the conversion from pyarrow to 
> the pandas dataframe vs the reading/decompression of the file itself which 
> might spawn additional threads.
> [set_cpu_count|http://arrow.apache.org/docs/python/generated/pyarrow.set_cpu_count.html#pyarrow.set_cpu_count]
>  might be good to mention as a way to actually limit threads spawned



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13317) [Python] Improve documentation on what 'use_threads' does in 'read_feather'

Reply via email to