[jira] [Updated] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-05 Thread Pac A. He (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pac A. He updated ARROW-11456:
--
Description: 
When reading or writing a large parquet file, I have this error:

{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 2.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 24 string. 
Below is code to reproduce the issue:

{code:python}
from base64 import urlsafe_b64encode

import numpy as np
import pandas as pd
import pyarrow as pa
import smart_open

def num_to_b64(num: int) -> str:
return urlsafe_b64encode(num.to_bytes(16, "little")).decode()

df = 
pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")

with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
index=False)
{code}

The dataframe is created correctly. When attempting to write it as a parquet 
file, the last line of the above code leads to the error:
{noformat}
pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
than 2147483646 child elements, got 25
{noformat}

  was:
When reading or writing a large parquet file, I have this error:

{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 2.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 24 string. 
Below is code to reproduce the issue:

{code:python}
from base64 import urlsafe_b64encode

import numpy as np
import pandas as pd
import pyarrow as pa
import smart_open

def num_to_b64(num: int) -> str:
return urlsafe_b64encode(num.to_bytes(16, "little")).decode()

df = 
pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")

with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
index=False)
{code}

The dataframe is created correctly. When attempting to write it as a parquet 
file, the above code leads to the error:
{noformat}
pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
than 2147483646 child elements, got 25
{noformat}


> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: 

[jira] [Updated] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-05 Thread Pac A. He (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pac A. He updated ARROW-11456:
--
Environment: 
pyarrow 3.0.0 / 2.0.0
pandas 1.1.5 / 1.2.1
smart_open 4.1.2
python 3.8.6

  was:
pyarrow 3.0.0 / 2.0.0
pandas 1.2.1
python 3.8.6


> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.1.5 / 1.2.1
> smart_open 4.1.2
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 2.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 24 string. 
> Below is code to reproduce the issue:
> {code:python}
> from base64 import urlsafe_b64encode
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import smart_open
> def num_to_b64(num: int) -> str:
> return urlsafe_b64encode(num.to_bytes(16, "little")).decode()
> df = 
> pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")
> with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
> df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
> index=False)
> {code}
> The dataframe is created correctly. When attempting to write it as a parquet 
> file, the above code leads to the error:
> {noformat}
> pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
> than 2147483646 child elements, got 25
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-05 Thread Pac A. He (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pac A. He updated ARROW-11456:
--
Description: 
When reading a large parquet file, I have this error:

{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 2.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 24 string. 
Below is code to reproduce the issue:

{code:python}
from base64 import urlsafe_b64encode

import boto3
import numpy as np
import pandas as pd
import pyarrow as pa
import smart_open

def num_to_b64(num: int) -> str:
return urlsafe_b64encode(num.to_bytes(16, "little")).decode()

df = 
pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")

with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
index=False)
{code}

The dataframe is created correctly. When attempting to write it, the above code 
leads to the error:
{noformat}
pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
than 2147483646 child elements, got 25
{noformat}

  was:
When reading a large parquet file, I have this error:

{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 2.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 24 string. 
Below is code to reproduce the issue:

{code:python}
from base64 import urlsafe_b64encode

import boto3
import numpy as np
import pandas as pd
import pyarrow as pa
import smart_open

def num_to_b64(num: int) -> str:
return urlsafe_b64encode(num.to_bytes(16, "little")).decode()

df = 
pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")

with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
index=False)
{code}

The above code leads to the error:
{noformat}
pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
than 2147483646 child elements, got 25
{noformat}


> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>

[jira] [Updated] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-05 Thread Pac A. He (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pac A. He updated ARROW-11456:
--
Description: 
When reading a large parquet file, I have this error:

{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 2.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 24 string. 
Below is code to reproduce the issue:

{code:python}
from base64 import urlsafe_b64encode

import numpy as np
import pandas as pd
import pyarrow as pa
import smart_open

def num_to_b64(num: int) -> str:
return urlsafe_b64encode(num.to_bytes(16, "little")).decode()

df = 
pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")

with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
index=False)
{code}

The dataframe is created correctly. When attempting to write it as a parquet 
file, the above code leads to the error:
{noformat}
pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
than 2147483646 child elements, got 25
{noformat}

  was:
When reading a large parquet file, I have this error:

{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 2.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 24 string. 
Below is code to reproduce the issue:

{code:python}
from base64 import urlsafe_b64encode

import boto3
import numpy as np
import pandas as pd
import pyarrow as pa
import smart_open

def num_to_b64(num: int) -> str:
return urlsafe_b64encode(num.to_bytes(16, "little")).decode()

df = 
pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")

with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
index=False)
{code}

The dataframe is created correctly. When attempting to write it, the above code 
leads to the error:
{noformat}
pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
than 2147483646 child elements, got 25
{noformat}


> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: 

[jira] [Updated] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-05 Thread Pac A. He (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pac A. He updated ARROW-11456:
--
Description: 
When reading a large parquet file, I have this error:

{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 2.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 24 string. 
Below is code to reproduce the issue:

{code:python}
from base64 import urlsafe_b64encode

import boto3
import numpy as np
import pandas as pd
import pyarrow as pa
import smart_open

def num_to_b64(num: int) -> str:
return urlsafe_b64encode(num.to_bytes(16, "little")).decode()

df = 
pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("strcol")

with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
index=False)
{code}

The above code leads to the error:
{noformat}
pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
than 2147483646 child elements, got 25
{noformat}

  was:
When reading a large parquet file, I have this error:

{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 2.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 24 string. 
Below is code to reproduce the issue:

{code:python}
from base64 import urlsafe_b64encode

import boto3
import numpy as np
import pandas as pd
import pyarrow as pa
import smart_open

def num_to_b64(num: int) -> str:
return urlsafe_b64encode(num.to_bytes(16, "little")).decode()

df = 
pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("string1")

with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
index=False)
{code}

The above code leads to the error:
{noformat}
pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
than 2147483646 child elements, got 25
{noformat}


> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: 

[jira] [Updated] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-05 Thread Pac A. He (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pac A. He updated ARROW-11456:
--
Description: 
When reading a large parquet file, I have this error:

{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 2.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 24 string. 
Below is code to reproduce the issue:

{code:python}
from base64 import urlsafe_b64encode

import boto3
import numpy as np
import pandas as pd
import pyarrow as pa
import smart_open

def num_to_b64(num: int) -> str:
return urlsafe_b64encode(num.to_bytes(16, "little")).decode()

df = 
pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")

with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
index=False)
{code}

The above code leads to the error:
{noformat}
pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
than 2147483646 child elements, got 25
{noformat}

  was:
When reading a large parquet file, I have this error:

{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 2.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 24 string. 
Below is code to reproduce the issue:

{code:python}
from base64 import urlsafe_b64encode

import boto3
import numpy as np
import pandas as pd
import pyarrow as pa
import smart_open

def num_to_b64(num: int) -> str:
return urlsafe_b64encode(num.to_bytes(16, "little")).decode()

df = 
pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("strcol")

with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
index=False)
{code}

The above code leads to the error:
{noformat}
pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
than 2147483646 child elements, got 25
{noformat}


> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 

[jira] [Updated] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-05 Thread Pac A. He (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pac A. He updated ARROW-11456:
--
Description: 
When reading a large parquet file, I have this error:

{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 2.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 24 string. 
Below is code to reproduce the issue:

{code:python}
from base64 import urlsafe_b64encode

import boto3
import numpy as np
import pandas as pd
import pyarrow as pa
import smart_open

def num_to_b64(num: int) -> str:
return urlsafe_b64encode(num.to_bytes(16, "little")).decode()

df = 
pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("string1")

with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
index=False)
{code}

The above code leads to the error:
{noformat}
pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
than 2147483646 child elements, got 25
{noformat}

  was:
When reading a large parquet file, I have this error:

 
{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 2.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 24 string. 
Below is code to reproduce the issue:

{code:python}
from base64 import urlsafe_b64encode

import boto3
import numpy as np
import pandas as pd
import pyarrow as pa
import smart_open

def num_to_b64(num: int) -> str:
return urlsafe_b64encode(num.to_bytes(16, "little")).decode()
{code}



> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> 

[jira] [Updated] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-05 Thread Pac A. He (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pac A. He updated ARROW-11456:
--
Description: 
When reading a large parquet file, I have this error:

 
{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 2.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 24 string. 
Below is code to reproduce the issue:

{code:python}
from base64 import urlsafe_b64encode

import boto3
import numpy as np
import pandas as pd
import pyarrow as pa
import smart_open

def num_to_b64(num: int) -> str:
return urlsafe_b64encode(num.to_bytes(16, "little")).decode()
{code}


  was:
When reading a large parquet file, I have this error:

 
{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 1.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 24 string.


> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 

[jira] [Updated] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-02 Thread Pac A. He (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pac A. He updated ARROW-11456:
--
Description: 
When reading a large parquet file, I have this error:

 
{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 1.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 24 string.

  was:
When reading a large parquet file, I have this error:

 
{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 1.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 22 string.


> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 

[jira] [Updated] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-02 Thread Pac A. He (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pac A. He updated ARROW-11456:
--
Description: 
When reading a large parquet file, I have this error:

 
{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 1.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 22 string.

  was:
When reading a large parquet file, I have this error:

 
{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 1.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 22 string


> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> 

[jira] [Updated] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-02 Thread Pac A. He (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pac A. He updated ARROW-11456:
--
Description: 
When reading a large parquet file, I have this error:

 
{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 1.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 22 string

  was:
When reading a large parquet file, I have this error:

 
{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.


> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 1.5 billion unique 
> rows. Each value was 

[jira] [Updated] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-01 Thread Pac A. He (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pac A. He updated ARROW-11456:
--
Description: 
When reading a large parquet file, I have this error:

 
{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

  was:
When reading a large parquet file, I have this error:

 
{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this file, 
but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.


> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-01 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-11456:
---
Summary: [Python] Parquet reader cannot read large strings  (was: OSError: 
Capacity error: BinaryBuilder cannot reserve space for more than 2147483646 
child elements)

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this file, 
> but now it doesn't let me read it back. I don't understand why arrow uses 
> [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)