[jira] [Commented] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-12 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283697#comment-17283697
 ] 

Joris Van den Bossche commented on ARROW-11456:
---

bq. Note that you may be able to do the conversion manually and force a Arrow 
large_string type, though I'm not sure Pandas allows that. 

Yes, pandas allows that by specifying a pyarrow schema manually (instead of 
letting pyarrow infer that from the dataframe).

For the example above, that would look like:

{code}
df.to_parquet(out, engine="pyarrow", compression="lz4", index=False, 
schema=pa.schema([("s", pa.large_string())]))
{code}

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.1.5 / 1.2.1
> smart_open 4.1.2
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading or writing a large parquet file, I have this error:
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 2.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 24 string. 
> Below is code to reproduce the issue:
> {code:python}
> from base64 import urlsafe_b64encode
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import smart_open
> def num_to_b64(num: int) -> str:
> return urlsafe_b64encode(num.to_bytes(16, "little")).decode()
> df = 
> pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")
> with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
> df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
> index=False)
> {code}
> The dataframe is created correctly. When attempting to write it as a parquet 
> file, the last line of the above code leads to the error:
> {noformat}
> pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
> than 2147483646 child elements, got 25
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-09 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281909#comment-17281909
 ] 

Antoine Pitrou commented on ARROW-11456:


Yeah, well, the first question is at which layer the error occurs. According to 
my reproducer, it may be during Pandas->Arrow conversion. But your reproducer 
is different...

Note that you may be able to do the conversion manually and force a Arrow 
{{large_string}} type, though I'm not sure Pandas allows that. I'll let 
[~jorisvandenbossche] comment on this.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.1.5 / 1.2.1
> smart_open 4.1.2
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading or writing a large parquet file, I have this error:
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 2.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 24 string. 
> Below is code to reproduce the issue:
> {code:python}
> from base64 import urlsafe_b64encode
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import smart_open
> def num_to_b64(num: int) -> str:
> return urlsafe_b64encode(num.to_bytes(16, "little")).decode()
> df = 
> pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")
> with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
> df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
> index=False)
> {code}
> The dataframe is created correctly. When attempting to write it as a parquet 
> file, the last line of the above code leads to the error:
> {noformat}
> pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
> than 2147483646 child elements, got 25
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-09 Thread Pac A. He (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281869#comment-17281869
 ] 

Pac A. He commented on ARROW-11456:
---

We have seen that there are one or more pyarrow limits at 2147483646 (about 
2**31) bytes and rows for a column. As a user I request this limit be increased 
to be somehwat closer to 2**64, so the downstream packages, e.g. pandas, etc., 
work transparently. It is unreasonable to ask me to write partitioned parquets 
given that fastparquet has no trouble writing a large parquet, so it's 
definitely technically feasible.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.1.5 / 1.2.1
> smart_open 4.1.2
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading or writing a large parquet file, I have this error:
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 2.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 24 string. 
> Below is code to reproduce the issue:
> {code:python}
> from base64 import urlsafe_b64encode
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import smart_open
> def num_to_b64(num: int) -> str:
> return urlsafe_b64encode(num.to_bytes(16, "little")).decode()
> df = 
> pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")
> with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
> df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
> index=False)
> {code}
> The dataframe is created correctly. When attempting to write it as a parquet 
> file, the last line of the above code leads to the error:
> {noformat}
> pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
> than 2147483646 child elements, got 25
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-09 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281854#comment-17281854
 ] 

Antoine Pitrou commented on ARROW-11456:


Thanks for the reproducer [~apacman] . Unfortunately, even 48 GB RAM is not 
enough to run it.

I tried to write another reproducer:
{code:python}
import numpy as np
import pandas as pd

import pyarrow as pa

df = pd.Series(["x" * 2_500_000_000]).astype("string").to_frame("s")

out = pa.BufferOutputStream()
df.to_parquet(out, engine="pyarrow", compression="lz4", index=False)
{code}

However, it fails a bit differently, so I'm not sure it's the same issue:
{code}
Traceback (most recent call last):
  File "../bug_11456.py", line 15, in 
df.to_parquet(out, engine="pyarrow", compression="lz4", index=False)
[...]
  File 
"/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pandas/core/arrays/string_.py",
 line 250, in __arrow_array__
return pa.array(values, type=type, from_pandas=True)
  File "pyarrow/array.pxi", line 301, in pyarrow.lib.array
return _ndarray_to_array(values, mask, type, c_from_pandas, safe,
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
check_status(NdarrayToArrow(pool, values, mask, from_pandas,
  File "pyarrow/error.pxi", line 109, in pyarrow.lib.check_status
raise ArrowCapacityError(message)
pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 
bytes, have 25
{code}


> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.1.5 / 1.2.1
> smart_open 4.1.2
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading or writing a large parquet file, I have this error:
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 2.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 24 string. 
> Below is code to reproduce the issue:
> {code:python}
> from base64 import urlsafe_b64encode
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import smart_open
> def num_to_b64(num: int) -> str:
> return urlsafe_b64encode(num.to_bytes(16, "little")).decode()
> df = 
> pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")
> with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
> df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
> index=False)
> {code}
> The dataframe is created correctly. When attempting to write it as a parquet 
> file, the last line of the above code leads to the error:
> {noformat}
> pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
> than 2147483646 child elements, got 25
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-05 Thread Pac A. He (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279918#comment-17279918
 ] 

Pac A. He commented on ARROW-11456:
---

I see. I have now added code to reproduce the issue. Basically, when I attempt 
to write a parquet file from a pandas dataframe having 2.5 billion unique 
string rows in a column, I have the error.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.1.5 / 1.2.1
> smart_open 4.1.2
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 2.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 24 string. 
> Below is code to reproduce the issue:
> {code:python}
> from base64 import urlsafe_b64encode
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import smart_open
> def num_to_b64(num: int) -> str:
> return urlsafe_b64encode(num.to_bytes(16, "little")).decode()
> df = 
> pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")
> with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
> df.to_parquet(output_file, engine="pyarrow", compression="gzip", 
> index=False)
> {code}
> The dataframe is created correctly. When attempting to write it as a parquet 
> file, the above code leads to the error:
> {noformat}
> pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
> than 2147483646 child elements, got 25
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-04 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279136#comment-17279136
 ] 

Weston Pace commented on ARROW-11456:
-

The 31 bit limit you are referencing is not the 31 bit limit that is at play 
here and not really relevant.  There is another 31 bit limit that has to do 
with how arrow stores strings.  Parquet does not need to support random access 
of strings.  The way it stores byte arrays & byte array lengths does not 
support random access.  You could not fetch the ith string of a parquet encoded 
utf8 byte array.

Arrow does need to support this use case.  It stores strings using two arrays.  
The first is an array of offsets.  The second is an array of bytes.  To fetch 
the ith string Arrow will look up offsets[i] and offsets[i+1] to determine the 
range that needs to be fetched from the array of bytes.

There are two string types in Arrow, "string" and "large_string".  The "string" 
data type uses 4 byte signed integer offsets while the "large_string" data type 
uses 8 byte signed integer offsets.  So it is not possible to create a "string" 
array with data containing more than 2 billion bytes.

Now, this is not normally a problem.  Arrow can fall back to a chunked array 
(which is why the 31 bit limit you reference isn't such an issue).
{code:java}
>>> import pyarrow as pa
>>> x = '0' * 1024
>>> y = [x] * (1024 * 1024 * 2)
>>> len(y)
2097152 // # of strings
>>> len(y) * 1024
2147483648 // # of bytes
>>> a = pa.array(y)
>>> len(a.chunks)
2
>>> len(a.chunks[0])
2097151
>>> len(a.chunks[1])
1
{code}
However, it does seem that, if there are 2 billion strings (as opposed to just 
2 billion bytes), the chunked array fallback is not applying.
{code:java}
>>> x = '0' * 8
>>> y = [x] * (1024 * 1024 * 1024 * 2)
>>> len(y)
2147483648
>>> len(y) * 8
17179869184
>>> a = pa.array(y)
Traceback (most recent call last):
  File "", line 1, in 
  File "pyarrow\array.pxi", line 296, in pyarrow.lib.array
  File "pyarrow\array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow\error.pxi", line 122, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\error.pxi", line 109, in pyarrow.lib.check_status
pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more 
than 2147483646 child elements, got 2147483648
{code}
This "should" be representable using a chunked array with two chunks.  It is 
possible this is the source of your issue.  Or maybe when reading from parquet 
the "fallback to chunked array" logic simply doesn't apply.  I don't know the 
parquet code well enough.  That is one of the reasons it would be helpful to 
have a reproducible test.

It also might be easier to just write your parquet out to multiple files or 
multiple row groups.  Both of these approaches should not only avoid this issue 
but also reduce the memory pressure when you are converting to pandas.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 1.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 24 string.




[jira] [Commented] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-04 Thread Pac A. He (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278976#comment-17278976
 ] 

Pac A. He commented on ARROW-11456:
---

Unfortunately I have not been able to produce a reproducible result in a simple 
example despite multiple attempts and experiments. It reproduces only over my 
actual data. Nevertheless, obviously the exception traceback and the error 
message could still be indicative of what's causing it. Having  31-bit limit 
makes no sense to me.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 1.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 24 string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-02 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277239#comment-17277239
 ] 

Joris Van den Bossche commented on ARROW-11456:
---

bq.  If you still need code, I can write a function to generate it.

That would help, yes.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 1.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 22 string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-02 Thread Pac A. He (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277234#comment-17277234
 ] 

Pac A. He commented on ARROW-11456:
---

For what it's worth, {{fastparquet}} v0.5.0 had no trouble at all reading such 
files. That's a workaround for now until this issue is resolved.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-01 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276517#comment-17276517
 ] 

Antoine Pitrou commented on ARROW-11456:


Was the Parquet file generated with Arrow?

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-01 Thread Pac A. He (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276501#comment-17276501
 ] 

Pac A. He commented on ARROW-11456:
---

[~jorisvandenbossche] This is very difficult in this case because the parquet 
is so large. What I can say is that this issue started *after I added a text 
string column with 1.3 billion unique rows. Each value was effectively a unique 
base64 encoded length 22 string*. I hope this helps.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-01 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276480#comment-17276480
 ] 

Joris Van den Bossche commented on ARROW-11456:
---

[~apacman] would you be able to provide a reproducible example? (eg some code 
that writes the parquet file)

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-01 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276462#comment-17276462
 ] 

Antoine Pitrou commented on ARROW-11456:


cc [~jorisvandenbossche]

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this file, 
> but now it doesn't let me read it back. I don't understand why arrow uses 
> [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)