[ https://issues.apache.org/jira/browse/ARROW-8677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099952#comment-17099952 ]
Francois Saint-Jacques commented on ARROW-8677: ----------------------------------------------- If you have the producing rust code also, it would help to verify if the problem is from the writer or the reader. > [Rust][Python][Parquet] Parquet write_batch and read from Python failes with > batch size 10000 or 1 but okay with 1000 > --------------------------------------------------------------------------------------------------------------------- > > Key: ARROW-8677 > URL: https://issues.apache.org/jira/browse/ARROW-8677 > Project: Apache Arrow > Issue Type: Bug > Components: Python, Rust > Affects Versions: 0.17.0 > Environment: Linux debian > Reporter: Novice > Priority: Critical > Attachments: test.parquet.tgz > > > I am using Rust to write Parquet file and read from Python. > When write_batch with 10000 batch size, reading the Parquet file from Python > gives the error below: > ``` > >>> pd.read_parquet("some.parquet", engine="pyarrow") > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "/home//.local/lib/python3.7/site-packages/pandas/io/parquet.py", line > 296, in read_parquet > return impl.read(path, columns=columns, **kwargs) > File "/home//.local/lib/python3.7/site-packages/pandas/io/parquet.py", line > 125, in read > path, columns=columns, **kwargs > File > "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", > line 1537, in read_table > use_pandas_metadata=use_pandas_metadata) > File > "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", > line 1262, in read > use_pandas_metadata=use_pandas_metadata) > File > "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", > line 707, in read > table = reader.read(**options) > File > "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", > line 337, in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1130, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status > OSError: Unexpected end of stream > ``` > Also, when using batch size 1 and then read from Python, there is error too: > ``` > >>> pd.read_parquet("some.parquet", engine="pyarrow") > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line > 296, in read_parquet > return impl.read(path, columns=columns, **kwargs) > File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line > 125, in read > path, columns=columns, **kwargs > File > "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", > line 1537, in read_table > use_pandas_metadata=use_pandas_metadata) > File > "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", > line 1262, in read > use_pandas_metadata=use_pandas_metadata) > File > "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", > line 707, in read > table = reader.read(**options) > File > "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", > line 337, in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1130, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status > OSError: The file only has 0 columns, requested metadata for column: 6 > ``` > Using batch size 1000 is fine. > Note that my data has 450047 rows. Schema: > ``` > message schema > { REQUIRED INT32 a; REQUIRED INT32 b; REQUIRED INT32 c; REQUIRED INT64 d; > REQUIRED INT32 e; REQUIRED BYTE_ARRAY f (UTF8); REQUIRED BOOLEAN g; } > ``` > > EDIT: as I add more rows (estimated 80 millions), using batch size 1000 does > not work too: > ``` > >>> df = pd.read_parquet("data/ping_pong.parquet", engine="pyarrow") > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line > 296, in read_parquet > return impl.read(path, columns=columns, **kwargs) > File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line > 125, in read > path, columns=columns, **kwargs > File > "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", > line 1537, in read_table > use_pandas_metadata=use_pandas_metadata) > File > "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", > line 1262, in read > use_pandas_metadata=use_pandas_metadata) > File > "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", > line 707, in read > table = reader.read(**options) > File > "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", > line 337, in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1130, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status > OSError: The file only has 0 columns, requested metadata for column: 6 > ``` > Unless I am using it wrong (which doesn't seem to be, since the API is > simple), this is not usable at all :( > > EDIT: some more logs, using 1000 batch size, a lot of rows: > ``` > >>> df = pd.read_parquet("ping_pong.parquet", engine="pyarrow") > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line > 296, in read_parquet > return impl.read(path, columns=columns, **kwargs) > File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line > 125, in read > path, columns=columns, **kwargs > File > "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", > line 1537, in read_table > use_pandas_metadata=use_pandas_metadata) > File > "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", > line 1262, in read > use_pandas_metadata=use_pandas_metadata) > File > "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", > line 707, in read > table = reader.read(**options) > File > "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", > line 337, in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1130, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status > OSError: The file only has -959432807 columns, requested metadata for > column: 6 > ``` > > EDIT: > I wanted to try fastparquet, but seems fastparquet does not support > .set_dictionary_enabled(true), so I set it to false. > Turns out fastparquet is fine, so likely a problem with pyarrow. > ``` > >>> df = pd.read_parquet("data/ping_pong.parquet", engine="pyarrow") > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line > 296, in read_parquet > return impl.read(path, columns=columns, **kwargs) > File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line > 125, in read > path, columns=columns, **kwargs > File > "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", > line 1281, in read_table > use_pandas_metadata=use_pandas_metadata) > File > "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", > line 1137, in read > use_pandas_metadata=use_pandas_metadata) > File > "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", > line 605, in read > table = reader.read(**options) > File > "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", > line 253, in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1136, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: The file only has -580697109 columns, requested metadata for > column: 5 > >>> df = pd.read_parquet("data/ping_pong.parquet", engine="fastparquet") > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)