[jira] [Commented] (ARROW-14723) [Python] pyarrow cannot import parquet files containing row groups whose lengths exceed int32 max.

Sarah Gilmore (Jira) Thu, 18 Nov 2021 08:13:07 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-14723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446022#comment-17446022
 ]


Sarah Gilmore commented on ARROW-14723:
---------------------------------------

Hi [~jorisvandenbossche],

Here's code you can use to generate to generate both files: [^main.cpp]. In the 
terminal, you'll be prompted to give the output filename and the number of rows 
you want the Parquet file to have. 

I noticed that if I linked against the latest version of Arrow (I believe 
7.0.0), the files created by the program can be read in via pyarrow. However, 
if you link against 4.0.1, Parquet files with row groups that exceed 2147483647 
in length cannot be read in via pyarrow. I suppose this issue has been resolved 
in a later release of Arrow?

 

Best,

Sarah

 

> [Python] pyarrow cannot import parquet files containing row groups whose 
> lengths exceed int32 max. 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-14723
>                 URL: https://issues.apache.org/jira/browse/ARROW-14723
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 5.0.0
>            Reporter: Sarah Gilmore
>            Priority: Minor
>         Attachments: intmax32.parq, intmax32plus1.parq, main.cpp
>
>
> It's possible to create Parquet files containing row groups whose lengths are 
> greater than int32 max (2147483647). However, Pyarrow cannot read these 
> files. 
> {code:java}
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> # intmax32.parq can be read in without any issues
> >>> t = pq.read_table("intmax32.parq"); 
> $ intmax32plus1.parq cannot be read in
> >>> t = pq.read_table("intmax32plus1.parq"); 
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 1895, in read_table
>     return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 1744, in read
>     table = self._dataset.to_table(
>   File "pyarrow/_dataset.pyx", line 465, in pyarrow._dataset.Dataset.to_table
>   File "pyarrow/_dataset.pyx", line 3075, in pyarrow._dataset.Scanner.to_table
>   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> OSError: Negative size (corrupt file?)
> {code}
>  
> However, both files can be imported via the C++ Arrow bindings without any 
> issues.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14723) [Python] pyarrow cannot import parquet files containing row groups whose lengths exceed int32 max.

Reply via email to