[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278976#comment-17278976
 ] 

Pac A. He commented on ARROW-11456:
-----------------------------------

Unfortunately I have not been able to produce a reproducible result in a simple 
example despite multiple attempts and experiments. It reproduces only over my 
actual data. Nevertheless, obviously the exception traceback and the error 
message could still be indicative of what's causing it. Having  31-bit limit 
makes no sense to me.

> [Python] Parquet reader cannot read large strings
> -------------------------------------------------
>
>                 Key: ARROW-11456
>                 URL: https://issues.apache.org/jira/browse/ARROW-11456
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0, 3.0.0
>         Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>            Reporter: Pac A. He
>            Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
>     df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
>     return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
>     return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
>     return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
>     return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 1.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 24 string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to