[ https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438721#comment-16438721 ]
ASF GitHub Bot commented on ARROW-2101: --------------------------------------- joshuastorck commented on issue #1886: ARROW-2101: [Python/C++] Correctly convert numpy arrays of bytes to arrow arrays of strings when user specifies arrow type of string URL: https://github.com/apache/arrow/pull/1886#issuecomment-381410530 @pitrou, on second look it won't be more efficient to move the check to outside of AppendObjectStrings. When passing check_valid to AppendObjectStrings, the UTF-8 decoding/check only happens if the data is Python 3 bytes or Python 2 strings. However, if the user passes Python 3 strings or Python 2 unicode and wants a string type, no extra checks are done. In the case where the user wants the output type to be an arrow string, then we need to do the check on each bytes object. Otherwise, we will return a StringArray that has data that's not actually UTF-8. Please let me know if that makes sense, and if not, let me know how you would make it faster. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas reads 'str' type as binary Arrow data with Python 2 > ------------------------------------------------------------------------ > > Key: ARROW-2101 > URL: https://issues.apache.org/jira/browse/ARROW-2101 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.8.0 > Reporter: Bryan Cutler > Assignee: Bryan Cutler > Priority: Major > Labels: pull-request-available > > Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow > data of binary type, even if the user supplies type information. conversion > of 'unicode' type works to create Arrow data of string types. For example > {code} > In [25]: pa.Array.from_pandas(pd.Series(['a'])).type > Out[25]: DataType(binary) > In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type > Out[26]: DataType(binary) > In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type > Out[27]: DataType(string) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)