[ https://issues.apache.org/jira/browse/ARROW-14723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sarah Gilmore updated ARROW-14723: ---------------------------------- Attachment: main.cpp > [Python] pyarrow cannot import parquet files containing row groups whose > lengths exceed int32 max. > --------------------------------------------------------------------------------------------------- > > Key: ARROW-14723 > URL: https://issues.apache.org/jira/browse/ARROW-14723 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 5.0.0 > Reporter: Sarah Gilmore > Priority: Minor > Attachments: intmax32.parq, intmax32plus1.parq, main.cpp > > > It's possible to create Parquet files containing row groups whose lengths are > greater than int32 max (2147483647). However, Pyarrow cannot read these > files. > {code:java} > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > # intmax32.parq can be read in without any issues > >>> t = pq.read_table("intmax32.parq"); > $ intmax32plus1.parq cannot be read in > >>> t = pq.read_table("intmax32plus1.parq"); > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File > "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1895, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1744, in read > table = self._dataset.to_table( > File "pyarrow/_dataset.pyx", line 465, in pyarrow._dataset.Dataset.to_table > File "pyarrow/_dataset.pyx", line 3075, in pyarrow._dataset.Scanner.to_table > File "pyarrow/error.pxi", line 143, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status > OSError: Negative size (corrupt file?) > {code} > > However, both files can be imported via the C++ Arrow bindings without any > issues. > -- This message was sent by Atlassian Jira (v8.20.1#820001)