Jochen Ott created ARROW-374:
--------------------------------
Summary: Python: clarify unicode vs. binary in API
Key: ARROW-374
URL: https://issues.apache.org/jira/browse/ARROW-374
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Affects Versions: 0.1.0
Reporter: Jochen Ott
Priority: Minor
pyarrow supports arrow's String type, arrow-internally represented as
BINARY+UTF8 annotation.
In python 2, the pyarrow API accept both {{unicode}} and binary strings
({{str}}), where the latter are assumed to be utf-8 encoded. I find this
approach problematic, because:
* there is an implicit assumption that a binary {{str}} contains valid utf-8
data. This assumption can be wrong, however, and it's not clear what the
consequences are of passing such "invalid data" to the API are.
* the utf-8 assumption is not clearly documented or otherwise visible from
the API
* if pyarrow wants to support pure binary data in the future, a natural choice
would be to use {{str}} as python2 type. However, this would conflict with the
current interpretation of binary {{str}} as BINARY+UTF8
*Proposed solution*
I propose to change the API that it only accepts or returns unicode strings,
i.e. python2's {{unicode}} and python3's {{str}}. Passing a python2 {{str}}
should raise an exception, same for python3's {{bytes}}.
If in some point in the future also raw BINARY is supported, use python3's
{{bytes}} and python2's {{str}}.
As convenience feature for API users, the API may allow to also pass utf-8
encoded binary data as arrow's String, but that should be an explicit, opt-in
choice, s.t. API users are aware of the (encoding-)assumptions made.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)