[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-12 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283697#comment-17283697
 ] 

Joris Van den Bossche edited comment on ARROW-11456 at 2/12/21, 1:44 PM:
-

bq. Note that you may be able to do the conversion manually and force a Arrow 
large_string type, though I'm not sure Pandas allows that. 

Yes, pandas allows that by specifying a pyarrow schema manually (instead of 
letting pyarrow infer that from the dataframe).

For the example above, that would look like:

{code}
df.to_parquet(out, engine="pyarrow", compression="lz4", index=False, 
schema=pa.schema([("s", pa.large_string())]))
{code}


[~apacman] does that help as a work-around?


was (Author: jorisvandenbossche):
bq. Note that you may be able to do the conversion manually and force a Arrow 
large_string type, though I'm not sure Pandas allows that. 

Yes, pandas allows that by specifying a pyarrow schema manually (instead of 
letting pyarrow infer that from the dataframe).

For the example above, that would look like:

{code}
df.to_parquet(out, engine="pyarrow", compression="lz4", index=False, 
schema=pa.schema([("s", pa.large_string())]))
{code}

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.1.5 / 1.2.1
> smart_open 4.1.2
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading or writing a large parquet file, I have this error:
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 2.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 24 string. 
> Below is code to reproduce the issue:
> {code:python}
> from base64 import urlsafe_b64encode
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import smart_open
> def num_to_b64(num: int) -> str:
> return urlsafe_b64encode(num.to_bytes(16, "little")).decode()
> df = 
> pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")
> with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
> df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
> index=False)
> {code}
> The dataframe is created correctly. When attempting to write it as a parquet 
> file, the last line of the above code leads to the error:
> {noformat}
> pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
> than 2147483646 child elements, got 25
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-09 Thread Pac A. He (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281869#comment-17281869
 ] 

Pac A. He edited comment on ARROW-11456 at 2/9/21, 4:22 PM:


We have seen that there are one or more pyarrow limits at 2147483646 (about 
2^31) bytes and rows for a column. As a user I request this limit be increased 
to be somehwat closer to 2^64, so the downstream packages, e.g. pandas, etc., 
work transparently. It is unreasonable to ask me to write partitioned parquets 
given that fastparquet has no trouble writing a large parquet, so it's 
definitely technically feasible.


was (Author: apacman):
We have seen that there are one or more pyarrow limits at 2147483646 (about 
2**31) bytes and rows for a column. As a user I request this limit be increased 
to be somehwat closer to 2**64, so the downstream packages, e.g. pandas, etc., 
work transparently. It is unreasonable to ask me to write partitioned parquets 
given that fastparquet has no trouble writing a large parquet, so it's 
definitely technically feasible.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.1.5 / 1.2.1
> smart_open 4.1.2
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading or writing a large parquet file, I have this error:
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 2.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 24 string. 
> Below is code to reproduce the issue:
> {code:python}
> from base64 import urlsafe_b64encode
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import smart_open
> def num_to_b64(num: int) -> str:
> return urlsafe_b64encode(num.to_bytes(16, "little")).decode()
> df = 
> pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")
> with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
> df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
> index=False)
> {code}
> The dataframe is created correctly. When attempting to write it as a parquet 
> file, the last line of the above code leads to the error:
> {noformat}
> pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
> than 2147483646 child elements, got 25
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-09 Thread Pac A. He (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281869#comment-17281869
 ] 

Pac A. He edited comment on ARROW-11456 at 2/9/21, 4:22 PM:


We have seen that there are one or more pyarrow limits at 2147483646 (about 
2^31) bytes and rows for a column. As a user I request this limit be increased 
to be somewhat closer to 2^64, so the downstream packages, e.g. pandas, etc., 
work transparently. It is unreasonable to ask me to write partitioned parquets 
given that fastparquet has no trouble writing a large parquet, so it's 
definitely technically feasible.


was (Author: apacman):
We have seen that there are one or more pyarrow limits at 2147483646 (about 
2^31) bytes and rows for a column. As a user I request this limit be increased 
to be somehwat closer to 2^64, so the downstream packages, e.g. pandas, etc., 
work transparently. It is unreasonable to ask me to write partitioned parquets 
given that fastparquet has no trouble writing a large parquet, so it's 
definitely technically feasible.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.1.5 / 1.2.1
> smart_open 4.1.2
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading or writing a large parquet file, I have this error:
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 2.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 24 string. 
> Below is code to reproduce the issue:
> {code:python}
> from base64 import urlsafe_b64encode
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import smart_open
> def num_to_b64(num: int) -> str:
> return urlsafe_b64encode(num.to_bytes(16, "little")).decode()
> df = 
> pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")
> with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
> df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
> index=False)
> {code}
> The dataframe is created correctly. When attempting to write it as a parquet 
> file, the last line of the above code leads to the error:
> {noformat}
> pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
> than 2147483646 child elements, got 25
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-05 Thread Pac A. He (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279918#comment-17279918
 ] 

Pac A. He edited comment on ARROW-11456 at 2/5/21, 7:01 PM:


I see. I have now added code to reproduce the issue. Basically, when I attempt 
to write a parquet file from a pandas dataframe having 2.5 billion unique 
string rows in a column, I have the error. Due to the large size of the 
dataframe, it will be memory and time intensive to test.


was (Author: apacman):
I see. I have now added code to reproduce the issue. Basically, when I attempt 
to write a parquet file from a pandas dataframe having 2.5 billion unique 
string rows in a column, I have the error.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.1.5 / 1.2.1
> smart_open 4.1.2
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading or writing a large parquet file, I have this error:
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 2.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 24 string. 
> Below is code to reproduce the issue:
> {code:python}
> from base64 import urlsafe_b64encode
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import smart_open
> def num_to_b64(num: int) -> str:
> return urlsafe_b64encode(num.to_bytes(16, "little")).decode()
> df = 
> pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")
> with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
> df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
> index=False)
> {code}
> The dataframe is created correctly. When attempting to write it as a parquet 
> file, the last line of the above code leads to the error:
> {noformat}
> pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
> than 2147483646 child elements, got 25
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-04 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279136#comment-17279136
 ] 

Weston Pace edited comment on ARROW-11456 at 2/4/21, 8:26 PM:
--

The 31 bit limit you are referencing is not the 31 bit limit that is at play 
here and not really relevant.  There is another 31 bit limit that has to do 
with how arrow stores strings.  Parquet does not need to support random access 
of strings.  The way it stores byte arrays & byte array lengths does not 
support random access.  You could not fetch the ith string of a parquet encoded 
utf8 byte array.

Arrow does need to support this use case.  It stores strings using two arrays.  
The first is an array of offsets.  The second is an array of bytes.  To fetch 
the ith string Arrow will look up offsets[i] and offsets[i+1] to determine the 
range that needs to be fetched from the array of bytes.

There are two string types in Arrow, "string" and "large_string".  The "string" 
data type uses 4 byte signed integer offsets while the "large_string" data type 
uses 8 byte signed integer offsets.  So it is not possible to create a "string" 
array with data containing more than 2 billion bytes.

Now, this is not normally a problem.  Arrow can fall back to a chunked array 
(which is why the 31 bit limit you reference isn't always such an issue).
{code:java}
>>> import pyarrow as pa
>>> x = '0' * 1024
>>> y = [x] * (1024 * 1024 * 2)
>>> len(y)
2097152 // # of strings
>>> len(y) * 1024
2147483648 // # of bytes
>>> a = pa.array(y)
>>> len(a.chunks)
2
>>> len(a.chunks[0])
2097151
>>> len(a.chunks[1])
1
{code}
However, it does seem that, if there are 2 billion strings (as opposed to just 
2 billion bytes), the chunked array fallback is not applying.
{code:java}
>>> x = '0' * 8
>>> y = [x] * (1024 * 1024 * 1024 * 2)
>>> len(y)
2147483648
>>> len(y) * 8
17179869184
>>> a = pa.array(y)
Traceback (most recent call last):
  File "", line 1, in 
  File "pyarrow\array.pxi", line 296, in pyarrow.lib.array
  File "pyarrow\array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow\error.pxi", line 122, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\error.pxi", line 109, in pyarrow.lib.check_status
pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
than 2147483646 child elements, got 2147483648
{code}
This "should" be representable using a chunked array with two chunks.  It is 
possible this is the source of your issue.  Or maybe when reading from parquet 
the "fallback to chunked array" logic simply doesn't apply.  I don't know the 
parquet code well enough.  That is one of the reasons it would be helpful to 
have a reproducible test.

It also might be easier to just write your parquet out to multiple files or 
multiple row groups.  Both of these approaches should not only avoid this issue 
but also reduce the memory pressure when you are converting to pandas.


was (Author: westonpace):
The 31 bit limit you are referencing is not the 31 bit limit that is at play 
here and not really relevant.  There is another 31 bit limit that has to do 
with how arrow stores strings.  Parquet does not need to support random access 
of strings.  The way it stores byte arrays & byte array lengths does not 
support random access.  You could not fetch the ith string of a parquet encoded 
utf8 byte array.

Arrow does need to support this use case.  It stores strings using two arrays.  
The first is an array of offsets.  The second is an array of bytes.  To fetch 
the ith string Arrow will look up offsets[i] and offsets[i+1] to determine the 
range that needs to be fetched from the array of bytes.

There are two string types in Arrow, "string" and "large_string".  The "string" 
data type uses 4 byte signed integer offsets while the "large_string" data type 
uses 8 byte signed integer offsets.  So it is not possible to create a "string" 
array with data containing more than 2 billion bytes.

Now, this is not normally a problem.  Arrow can fall back to a chunked array 
(which is why the 31 bit limit you reference isn't such an issue).
{code:java}
>>> import pyarrow as pa
>>> x = '0' * 1024
>>> y = [x] * (1024 * 1024 * 2)
>>> len(y)
2097152 // # of strings
>>> len(y) * 1024
2147483648 // # of bytes
>>> a = pa.array(y)
>>> len(a.chunks)
2
>>> len(a.chunks[0])
2097151
>>> len(a.chunks[1])
1
{code}
However, it does seem that, if there are 2 billion strings (as opposed to just 
2 billion bytes), the chunked array fallback is not applying.
{code:java}
>>> x = '0' * 8
>>> y = [x] * (1024 * 1024 * 1024 * 2)
>>> len(y)
2147483648
>>> len(y) * 8
17179869184
>>> a = pa.array(y)
Traceback (most recent call last):
  File "", line 1, in 
  File "pyarrow\array.pxi", line 296, in pyarrow.lib.array
  File "pyarrow\array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow\error.pxi", line 122, in 

[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-04 Thread Pac A. He (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278976#comment-17278976
 ] 

Pac A. He edited comment on ARROW-11456 at 2/4/21, 5:09 PM:


Unfortunately I have not been able to produce a reproducible result in a simple 
example despite multiple attempts and experiments. I read a dataframe with 10 
string columns and 2 billion rows without issue. The issue reproduces only over 
my actual data. Nevertheless, obviously the exception traceback and the error 
message could still be indicative of what's causing it. Having  31-bit limit 
makes no sense to me.


was (Author: apacman):
Unfortunately I have not been able to produce a reproducible result in a simple 
example despite multiple attempts and experiments. I tried reading a dataframe 
with 10 string columns and 2 billion rows. The issue reproduces only over my 
actual data. Nevertheless, obviously the exception traceback and the error 
message could still be indicative of what's causing it. Having  31-bit limit 
makes no sense to me.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 1.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 24 string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-04 Thread Pac A. He (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278976#comment-17278976
 ] 

Pac A. He edited comment on ARROW-11456 at 2/4/21, 5:09 PM:


Unfortunately I have not been able to produce a reproducible result in a simple 
example despite multiple attempts and experiments. I read a dataframe with 10 
string columns and 2 billion rows without issue. The issue reproduces only over 
my actual real-world data. Nevertheless, obviously the exception traceback and 
the error message could still be indicative of what's causing it. Having  
31-bit limit makes no sense to me.


was (Author: apacman):
Unfortunately I have not been able to produce a reproducible result in a simple 
example despite multiple attempts and experiments. I read a dataframe with 10 
string columns and 2 billion rows without issue. The issue reproduces only over 
my actual data. Nevertheless, obviously the exception traceback and the error 
message could still be indicative of what's causing it. Having  31-bit limit 
makes no sense to me.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 1.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 24 string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-04 Thread Pac A. He (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278976#comment-17278976
 ] 

Pac A. He edited comment on ARROW-11456 at 2/4/21, 5:08 PM:


Unfortunately I have not been able to produce a reproducible result in a simple 
example despite multiple attempts and experiments. I tried reading a dataframe 
with 10 string columns and 2 billion rows. The issue reproduces only over my 
actual data. Nevertheless, obviously the exception traceback and the error 
message could still be indicative of what's causing it. Having  31-bit limit 
makes no sense to me.


was (Author: apacman):
Unfortunately I have not been able to produce a reproducible result in a simple 
example despite multiple attempts and experiments. It reproduces only over my 
actual data. Nevertheless, obviously the exception traceback and the error 
message could still be indicative of what's causing it. Having  31-bit limit 
makes no sense to me.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 1.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 24 string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-02 Thread Pac A. He (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277234#comment-17277234
 ] 

Pac A. He edited comment on ARROW-11456 at 2/2/21, 4:12 PM:


For what it's worth, {{fastparquet}} v0.5.0 had no trouble at all reading such 
files. That's a workaround for now, if only for Python, until this issue is 
resolved.


was (Author: apacman):
For what it's worth, {{fastparquet}} v0.5.0 had no trouble at all reading such 
files. That's a workaround for now until this issue is resolved.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 1.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 22 string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-01 Thread Pac A. He (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276501#comment-17276501
 ] 

Pac A. He edited comment on ARROW-11456 at 2/1/21, 5:21 PM:


[~jorisvandenbossche] This is difficult in this case because the parquet is so 
large. What I can say is that this issue started *after I added a text string 
column with 1.5 billion unique rows. Each value was effectively a unique base64 
encoded length 22 string*. I hope this helps. If you still need code, I can 
write a function to generate it.


was (Author: apacman):
[~jorisvandenbossche] This is very difficult in this case because the parquet 
is so large. What I can say is that this issue started *after I added a text 
string column with 1.3 billion unique rows. Each value was effectively a unique 
base64 encoded length 22 string*. I hope this helps. If you still need code, I 
can write a function to generate it.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-01 Thread Pac A. He (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276501#comment-17276501
 ] 

Pac A. He edited comment on ARROW-11456 at 2/1/21, 5:21 PM:


[~jorisvandenbossche] This is difficult in this case because the parquet is so 
large. What I can say is that this issue started *after I added a string column 
with 1.5 billion unique rows. Each value was effectively a unique base64 
encoded length 22 string*. I hope this helps. If you still need code, I can 
write a function to generate it.


was (Author: apacman):
[~jorisvandenbossche] This is difficult in this case because the parquet is so 
large. What I can say is that this issue started *after I added a text string 
column with 1.5 billion unique rows. Each value was effectively a unique base64 
encoded length 22 string*. I hope this helps. If you still need code, I can 
write a function to generate it.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-01 Thread Pac A. He (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276501#comment-17276501
 ] 

Pac A. He edited comment on ARROW-11456 at 2/1/21, 5:20 PM:


[~jorisvandenbossche] This is very difficult in this case because the parquet 
is so large. What I can say is that this issue started *after I added a text 
string column with 1.3 billion unique rows. Each value was effectively a unique 
base64 encoded length 22 string*. I hope this helps. If you still need code, I 
can write a function to generate it.


was (Author: apacman):
[~jorisvandenbossche] This is very difficult in this case because the parquet 
is so large. What I can say is that this issue started *after I added a text 
string column with 1.3 billion unique rows. Each value was effectively a unique 
base64 encoded length 22 string*. I hope this helps.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)