Diego Argueta created ARROW-4883:
------------------------------------
Summary: [Python] read_csv() gives mojibake if given file object
in text mode
Key: ARROW-4883
URL: https://issues.apache.org/jira/browse/ARROW-4883
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.12.1
Environment: Python: 3.7.2, 2.7.15
PyArrow: 0.12.1
OS: MacOS 10.13.6 (High Sierra)
Reporter: Diego Argueta
h1. Summary:
Python 3:
* {{read_csv}} returns mojibake if given file objects opened in text mode. It
behaves as expected in binary mode.
* Files encoded in anything other than valid UTF-8 will cause a crash.
Python 2:
{{read_csv}} only handles ASCII files. If given a file in UTF-8 with characters
over U+007F, it crashes.
h1. To reproduce:
1) Create a CSV like this
{code}
Header
123.45
{code}
2) Then run this code on Python 3:
{code:python}
>>> import pyarrow.csv as pa_csv
>>> pa_csv.read_csv(open('test.csv', 'r'))
pyarrow.Table
䧢: string
{code}
Notice the file descriptor is open in text mode. Changing the encoding doesn't
help:
{code:python}
>>> pa_csv.read_csv(open('test.csv', 'r', encoding='utf-8'))
pyarrow.Table
䧢: string
>>> pa_csv.read_csv(open('test.csv', 'r', encoding='ascii'))
pyarrow.Table
䧢: string
>>> pa_csv.read_csv(open('test.csv', 'r', encoding='iso-8859-1'))
pyarrow.Table
䧢: string
{code}
If I open the file in binary mode it works:
{code:python}
>>> pa_csv.read_csv(open('test.csv', 'rb'))
>>>
>>>
pyarrow.Table
Header: double
{code}
I tried this with a file encoded in UTF-16 and it freaked out:
{code}
Traceback (most recent call last):
File
"<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/ptpython/repl.py",
line 84, in _process_text
self._execute(line)
File
"<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/ptpython/repl.py",
line 139, in _execute
result_str = '%s\n' % repr(result).decode('utf-8')
File "pyarrow/table.pxi", line 960, in pyarrow.lib.Table.__repr__
File "pyarrow/types.pxi", line 903, in pyarrow.lib.Schema.__str__
File
"<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/compat.py",
line 143, in frombytes
return o.decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid
start byte
'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
{code}
Presumably this is because the code always assumes the file is in UTF-8.
h2. Python 2 behavior
Python 2 behaves differently -- it uses the ASCII codec by default, so when
handed a file encoded in UTF-8, it will return without an error. Try to access
the table...
{code}
>>> t = pa_csv.read_csv(open('/Users/diegoargueta/Desktop/test.csv', 'r'))
>>> list(t)
Traceback (most recent call last):
File
"/Users/diegoargueta/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/ptpython/repl.py",
line 84, in _process_text
self._execute(line)
File
"/Users/diegoargueta/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/ptpython/repl.py",
line 139, in _execute
result_str = '%s\n' % repr(result).decode('utf-8')
File "pyarrow/table.pxi", line 387, in pyarrow.lib.Column.__repr__
result.write('\n{}'.format(str(self.data)))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 11:
ordinal not in range(128)
'ascii' codec can't decode byte 0xe4 in position 11: ordinal not in range(128)
{code}
h1. Expectation
We should be able to hand read_csv() a file in text mode so that the CSV file
can be in any text encoding.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)