[ https://issues.apache.org/jira/browse/ARROW-10498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Krisztian Szucs updated ARROW-10498: ------------------------------------ Affects Version/s: 1.0.1 > pyarrow.Table.from_* methods appear to cut off binary data after an embedded > zero byte > -------------------------------------------------------------------------------------- > > Key: ARROW-10498 > URL: https://issues.apache.org/jira/browse/ARROW-10498 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 1.0.1, 2.0.0 > Environment: > python > Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: > Anaconda, Inc. on win32 > Type "help", "copyright", "credits" or "license" for more information. > Reporter: Jason Sachs > Priority: Critical > > The pyarrow.Table.from_* methods appear to cut off binary data after an > embedded zero byte. > {code} > >>> import numpy as np > >>> import pyarrow as pa > >>> > >>> data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!', > ... b'\x00Baz!', b'half\x00baked', b''], dtype='|S13') > >>> t = pa.Table.from_pydict({'data':data}) > >>> t.to_pandas() > data > 0 b'' > 1 b'' > 2 b'' > 3 b'Foo!!' > 4 b'Bar!!' > 5 b'' > 6 b'half' > 7 b'' > >>> import pandas as pd > >>> pd.DataFrame(data) > 0 > 0 b'' > 1 b'' > 2 b'' > 3 b'Foo!!' > 4 b'Bar!!' > 5 b'\x00Baz!' > 6 b'half\x00baked' > 7 b'' > {code} > Another test case (perhaps it's in the pyarrow.Table -> to_pandas() > conversion step?): > {code} > import numpy as np > import pyarrow as pa > import pandas as pd > print('PyArrow version: %s' % pa.__version__) > data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!', > b'\x00Baz!', b'half\x00baked', b''], dtype='|S13') > df1 = pd.DataFrame(data, columns=['data']) > print('\ndf1:\n', df1) > pqfile = '10498.pq' > df1.to_parquet(pqfile) > > tables = {'from_pydict': pa.Table.from_pydict({'data':data}), > 'from_arrays': pa.Table.from_arrays([data],['data']), > 'from_pandas': pa.Table.from_pandas(df1), > 'read_table': pa.parquet.read_table(pqfile) > } > for k,v in tables.items(): > print("\ntables['%s'].to_pandas():\n" % k, > v.to_pandas()) > > print('Pandas from parquet file:\n', pd.read_parquet(pqfile)) > for k,v in tables.items(): > print("tables['%s']['data'][6]=%s" % (k,v['data'][6])) > {code} > which prints on my machine > {noformat} > >python arrow10498.py > PyArrow version: 2.0.0 > df1: > data > 0 b'' > 1 b'' > 2 b'' > 3 b'Foo!!' > 4 b'Bar!!' > 5 b'\x00Baz!' > 6 b'half\x00baked' > 7 b'' > tables['from_pydict'].to_pandas(): > data > 0 b'' > 1 b'' > 2 b'' > 3 b'Foo!!' > 4 b'Bar!!' > 5 b'' > 6 b'half' > 7 b'' > tables['from_arrays'].to_pandas(): > data > 0 b'' > 1 b'' > 2 b'' > 3 b'Foo!!' > 4 b'Bar!!' > 5 b'' > 6 b'half' > 7 b'' > tables['from_pandas'].to_pandas(): > data > 0 b'' > 1 b'' > 2 b'' > 3 b'Foo!!' > 4 b'Bar!!' > 5 b'' > 6 b'half' > 7 b'' > tables['read_table'].to_pandas(): > data > 0 b'' > 1 b'' > 2 b'' > 3 b'Foo!!' > 4 b'Bar!!' > 5 b'' > 6 b'half' > 7 b'' > Pandas from parquet file: > data > 0 b'' > 1 b'' > 2 b'' > 3 b'Foo!!' > 4 b'Bar!!' > 5 b'' > 6 b'half' > 7 b'' > tables['from_pydict']['data'][6]=b'half' > tables['from_arrays']['data'][6]=b'half' > tables['from_pandas']['data'][6]=b'half' > tables['read_table']['data'][6]=b'half' > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)