[jira] [Commented] (ARROW-14621) [Python] read_feather's "columns" argument claims to support any iterable but does not accept pandas series

Joris Van den Bossche (Jira) Tue, 09 Nov 2021 07:14:06 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-14621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441215#comment-17441215
 ]


Joris Van den Bossche commented on ARROW-14621:
-----------------------------------------------

Proposal: coerce arguments that accept a list-like to list (`list(...)`) where 
needed (so where not just iterating through the object). 

That seems more generally useful and cleaner (compared to eg explicitly 
starting to check for list as argument type)

> [Python] read_feather's "columns" argument claims to support any iterable but 
> does not accept pandas series
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-14621
>                 URL: https://issues.apache.org/jira/browse/ARROW-14621
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Weston Pace
>            Priority: Major
>
> From https://github.com/apache/arrow/issues/11500:
> {quote}
> According to the read_feather documentation it accepts any sequence type ( 
> https://arrow.apache.org/docs/python/generated/pyarrow.feather.read_feather.html
>  ) so not strictly lists, but also tuples etc... As far as they contain int 
> or str.
> While a pandas.Index adheres to the sequence protocol it provides custom 
> implementation of many methods you would expect in a container. I personally 
> think that supporting all possible implementations of classes that adheres to 
> sequence protocol even when they are heavily custom is probably out of scope, 
> so I would personally change the docstring to mention list|tuple and avoid 
> confusion.
> Implementing support for pandas.Index is fairly quick but that would probably 
> end up casting the index to a tuple or list if we want to keep the code 
> generic. And that would mean introducing an extra cost that the user doesn't 
> know it's paying when invoking that method. I guess it's more reasonable to 
> make that cost explicit and have user cast to list or tuple
> {quote}
> [~amol-]
> {quote}
> Technically the sequence protocol does not define equality. The problem seems 
> to originate from the line sorted(set(columns)) == columns. We are relying on 
> list == sequence => bool which is not valid when the sequence is a numpy 
> array (list == np.ndarray => np.ndarray).
> The correct method for comparing sequences seems to be converting both sides 
> to list or using imap (although given we are already doing 
> sorted(set(columns)) I think imap would be overkill).
> I'm in favor of @amol- 's general point though. Changing the docs to mention 
> list|tuple is probably a good solution.
> {quote}
> [~westonpace]
> {quote}
> On the other hand, supporting numpy arrays for those kind of arguments that 
> expect a list is also relatively easy. We have quite some other methods that 
> document a certain argument as list, while still accepting a numpy array 
> (because in most cases we just rely on 1) length check and 2) iteration 
> through the values).
> I think the most important would be to have consistency. But if we have some 
> APIs that are currently strict on requiring a list, and others loose on 
> requiring any iterable, then there are two options: make everything strict, 
> or fix those cases that are now strict to have them work with any iterable.
> But to me it would also feel strange to start adding strict isinstance(.., 
> list) checks in several places that currently are perfectly fine with 
> accepting a numpy array or any iterable ...
> {quote}
> [~jorisvandenbossche]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14621) [Python] read_feather's "columns" argument claims to support any iterable but does not accept pandas series

Reply via email to