[
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281854#comment-17281854
]
Antoine Pitrou commented on ARROW-11456:
----------------------------------------
Thanks for the reproducer [~apacman] . Unfortunately, even 48 GB RAM is not
enough to run it.
I tried to write another reproducer:
{code:python}
import numpy as np
import pandas as pd
import pyarrow as pa
df = pd.Series(["x" * 2_500_000_000]).astype("string").to_frame("s")
out = pa.BufferOutputStream()
df.to_parquet(out, engine="pyarrow", compression="lz4", index=False)
{code}
However, it fails a bit differently, so I'm not sure it's the same issue:
{code}
Traceback (most recent call last):
File "../bug_11456.py", line 15, in <module>
df.to_parquet(out, engine="pyarrow", compression="lz4", index=False)
[...]
File
"/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pandas/core/arrays/string_.py",
line 250, in __arrow_array__
return pa.array(values, type=type, from_pandas=True)
File "pyarrow/array.pxi", line 301, in pyarrow.lib.array
return _ndarray_to_array(values, mask, type, c_from_pandas, safe,
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
check_status(NdarrayToArrow(pool, values, mask, from_pandas,
File "pyarrow/error.pxi", line 109, in pyarrow.lib.check_status
raise ArrowCapacityError(message)
pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646
bytes, have 2500000000
{code}
> [Python] Parquet reader cannot read large strings
> -------------------------------------------------
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.1.5 / 1.2.1
> smart_open 4.1.2
> python 3.8.6
> Reporter: Pac A. He
> Priority: Major
>
> When reading or writing a large parquet file, I have this error:
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
> File
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py",
> line 459, in read_parquet
> return impl.read(
> File
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py",
> line 221, in read
> return self.api.parquet.read_table(
> File
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py",
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
> File
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py",
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
> File "pyarrow/_parquet.pyx", line 1126, in
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this
> parquet file, but now it doesn't let me read it back. I don't understand why
> arrow uses [31-bit
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths]
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 2.5 billion unique
> rows. Each value was effectively a unique base64 encoded length 24 string.
> Below is code to reproduce the issue:
> {code:python}
> from base64 import urlsafe_b64encode
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import smart_open
> def num_to_b64(num: int) -> str:
> return urlsafe_b64encode(num.to_bytes(16, "little")).decode()
> df =
> pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")
> with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
> df.to_parquet(output_file, engine="pyarrow", compression="gzip",
> index=False)
> {code}
> The dataframe is created correctly. When attempting to write it as a parquet
> file, the last line of the above code leads to the error:
> {noformat}
> pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more
> than 2147483646 child elements, got 2500000000
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)