[ https://issues.apache.org/jira/browse/ARROW-14621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441215#comment-17441215 ]
Joris Van den Bossche commented on ARROW-14621: ----------------------------------------------- Proposal: coerce arguments that accept a list-like to list (`list(...)`) where needed (so where not just iterating through the object). That seems more generally useful and cleaner (compared to eg explicitly starting to check for list as argument type) > [Python] read_feather's "columns" argument claims to support any iterable but > does not accept pandas series > ----------------------------------------------------------------------------------------------------------- > > Key: ARROW-14621 > URL: https://issues.apache.org/jira/browse/ARROW-14621 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Weston Pace > Priority: Major > > From https://github.com/apache/arrow/issues/11500: > {quote} > According to the read_feather documentation it accepts any sequence type ( > https://arrow.apache.org/docs/python/generated/pyarrow.feather.read_feather.html > ) so not strictly lists, but also tuples etc... As far as they contain int > or str. > While a pandas.Index adheres to the sequence protocol it provides custom > implementation of many methods you would expect in a container. I personally > think that supporting all possible implementations of classes that adheres to > sequence protocol even when they are heavily custom is probably out of scope, > so I would personally change the docstring to mention list|tuple and avoid > confusion. > Implementing support for pandas.Index is fairly quick but that would probably > end up casting the index to a tuple or list if we want to keep the code > generic. And that would mean introducing an extra cost that the user doesn't > know it's paying when invoking that method. I guess it's more reasonable to > make that cost explicit and have user cast to list or tuple > {quote} > [~amol-] > {quote} > Technically the sequence protocol does not define equality. The problem seems > to originate from the line sorted(set(columns)) == columns. We are relying on > list == sequence => bool which is not valid when the sequence is a numpy > array (list == np.ndarray => np.ndarray). > The correct method for comparing sequences seems to be converting both sides > to list or using imap (although given we are already doing > sorted(set(columns)) I think imap would be overkill). > I'm in favor of @amol- 's general point though. Changing the docs to mention > list|tuple is probably a good solution. > {quote} > [~westonpace] > {quote} > On the other hand, supporting numpy arrays for those kind of arguments that > expect a list is also relatively easy. We have quite some other methods that > document a certain argument as list, while still accepting a numpy array > (because in most cases we just rely on 1) length check and 2) iteration > through the values). > I think the most important would be to have consistency. But if we have some > APIs that are currently strict on requiring a list, and others loose on > requiring any iterable, then there are two options: make everything strict, > or fix those cases that are now strict to have them work with any iterable. > But to me it would also feel strange to start adding strict isinstance(.., > list) checks in several places that currently are perfectly fine with > accepting a numpy array or any iterable ... > {quote} > [~jorisvandenbossche] -- This message was sent by Atlassian Jira (v8.20.1#820001)