[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings
[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283697#comment-17283697 ] Joris Van den Bossche edited comment on ARROW-11456 at 2/12/21, 1:44 PM: - bq. Note that you may be able to do the conversion manually and force a Arrow large_string type, though I'm not sure Pandas allows that. Yes, pandas allows that by specifying a pyarrow schema manually (instead of letting pyarrow infer that from the dataframe). For the example above, that would look like: {code} df.to_parquet(out, engine="pyarrow", compression="lz4", index=False, schema=pa.schema([("s", pa.large_string())])) {code} [~apacman] does that help as a work-around? was (Author: jorisvandenbossche): bq. Note that you may be able to do the conversion manually and force a Arrow large_string type, though I'm not sure Pandas allows that. Yes, pandas allows that by specifying a pyarrow schema manually (instead of letting pyarrow infer that from the dataframe). For the example above, that would look like: {code} df.to_parquet(out, engine="pyarrow", compression="lz4", index=False, schema=pa.schema([("s", pa.large_string())])) {code} > [Python] Parquet reader cannot read large strings > - > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.1.5 / 1.2.1 > smart_open 4.1.2 > python 3.8.6 >Reporter: Pac A. He >Priority: Major > > When reading or writing a large parquet file, I have this error: > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] > It's not even 32-bit as sizes are non-negative. > This problem started after I added a string column with 2.5 billion unique > rows. Each value was effectively a unique base64 encoded length 24 string. > Below is code to reproduce the issue: > {code:python} > from base64 import urlsafe_b64encode > import numpy as np > import pandas as pd > import pyarrow as pa > import smart_open > def num_to_b64(num: int) -> str: > return urlsafe_b64encode(num.to_bytes(16, "little")).decode() > df = > pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s") > with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file: > df.to_parquet(output_file, engine="pyarrow", compression="gzip", > index=False) > {code} > The dataframe is created correctly. When attempting to write it as a parquet > file, the last line of the above code leads to the error: > {noformat} > pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more > than 2147483646 child elements, got 25 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings
[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281869#comment-17281869 ] Pac A. He edited comment on ARROW-11456 at 2/9/21, 4:22 PM: We have seen that there are one or more pyarrow limits at 2147483646 (about 2^31) bytes and rows for a column. As a user I request this limit be increased to be somehwat closer to 2^64, so the downstream packages, e.g. pandas, etc., work transparently. It is unreasonable to ask me to write partitioned parquets given that fastparquet has no trouble writing a large parquet, so it's definitely technically feasible. was (Author: apacman): We have seen that there are one or more pyarrow limits at 2147483646 (about 2**31) bytes and rows for a column. As a user I request this limit be increased to be somehwat closer to 2**64, so the downstream packages, e.g. pandas, etc., work transparently. It is unreasonable to ask me to write partitioned parquets given that fastparquet has no trouble writing a large parquet, so it's definitely technically feasible. > [Python] Parquet reader cannot read large strings > - > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.1.5 / 1.2.1 > smart_open 4.1.2 > python 3.8.6 >Reporter: Pac A. He >Priority: Major > > When reading or writing a large parquet file, I have this error: > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] > It's not even 32-bit as sizes are non-negative. > This problem started after I added a string column with 2.5 billion unique > rows. Each value was effectively a unique base64 encoded length 24 string. > Below is code to reproduce the issue: > {code:python} > from base64 import urlsafe_b64encode > import numpy as np > import pandas as pd > import pyarrow as pa > import smart_open > def num_to_b64(num: int) -> str: > return urlsafe_b64encode(num.to_bytes(16, "little")).decode() > df = > pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s") > with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file: > df.to_parquet(output_file, engine="pyarrow", compression="gzip", > index=False) > {code} > The dataframe is created correctly. When attempting to write it as a parquet > file, the last line of the above code leads to the error: > {noformat} > pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more > than 2147483646 child elements, got 25 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings
[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281869#comment-17281869 ] Pac A. He edited comment on ARROW-11456 at 2/9/21, 4:22 PM: We have seen that there are one or more pyarrow limits at 2147483646 (about 2^31) bytes and rows for a column. As a user I request this limit be increased to be somewhat closer to 2^64, so the downstream packages, e.g. pandas, etc., work transparently. It is unreasonable to ask me to write partitioned parquets given that fastparquet has no trouble writing a large parquet, so it's definitely technically feasible. was (Author: apacman): We have seen that there are one or more pyarrow limits at 2147483646 (about 2^31) bytes and rows for a column. As a user I request this limit be increased to be somehwat closer to 2^64, so the downstream packages, e.g. pandas, etc., work transparently. It is unreasonable to ask me to write partitioned parquets given that fastparquet has no trouble writing a large parquet, so it's definitely technically feasible. > [Python] Parquet reader cannot read large strings > - > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.1.5 / 1.2.1 > smart_open 4.1.2 > python 3.8.6 >Reporter: Pac A. He >Priority: Major > > When reading or writing a large parquet file, I have this error: > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] > It's not even 32-bit as sizes are non-negative. > This problem started after I added a string column with 2.5 billion unique > rows. Each value was effectively a unique base64 encoded length 24 string. > Below is code to reproduce the issue: > {code:python} > from base64 import urlsafe_b64encode > import numpy as np > import pandas as pd > import pyarrow as pa > import smart_open > def num_to_b64(num: int) -> str: > return urlsafe_b64encode(num.to_bytes(16, "little")).decode() > df = > pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s") > with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file: > df.to_parquet(output_file, engine="pyarrow", compression="gzip", > index=False) > {code} > The dataframe is created correctly. When attempting to write it as a parquet > file, the last line of the above code leads to the error: > {noformat} > pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more > than 2147483646 child elements, got 25 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings
[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279918#comment-17279918 ] Pac A. He edited comment on ARROW-11456 at 2/5/21, 7:01 PM: I see. I have now added code to reproduce the issue. Basically, when I attempt to write a parquet file from a pandas dataframe having 2.5 billion unique string rows in a column, I have the error. Due to the large size of the dataframe, it will be memory and time intensive to test. was (Author: apacman): I see. I have now added code to reproduce the issue. Basically, when I attempt to write a parquet file from a pandas dataframe having 2.5 billion unique string rows in a column, I have the error. > [Python] Parquet reader cannot read large strings > - > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.1.5 / 1.2.1 > smart_open 4.1.2 > python 3.8.6 >Reporter: Pac A. He >Priority: Major > > When reading or writing a large parquet file, I have this error: > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] > It's not even 32-bit as sizes are non-negative. > This problem started after I added a string column with 2.5 billion unique > rows. Each value was effectively a unique base64 encoded length 24 string. > Below is code to reproduce the issue: > {code:python} > from base64 import urlsafe_b64encode > import numpy as np > import pandas as pd > import pyarrow as pa > import smart_open > def num_to_b64(num: int) -> str: > return urlsafe_b64encode(num.to_bytes(16, "little")).decode() > df = > pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s") > with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file: > df.to_parquet(output_file, engine="pyarrow", compression="gzip", > index=False) > {code} > The dataframe is created correctly. When attempting to write it as a parquet > file, the last line of the above code leads to the error: > {noformat} > pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more > than 2147483646 child elements, got 25 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings
[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279136#comment-17279136 ] Weston Pace edited comment on ARROW-11456 at 2/4/21, 8:26 PM: -- The 31 bit limit you are referencing is not the 31 bit limit that is at play here and not really relevant. There is another 31 bit limit that has to do with how arrow stores strings. Parquet does not need to support random access of strings. The way it stores byte arrays & byte array lengths does not support random access. You could not fetch the ith string of a parquet encoded utf8 byte array. Arrow does need to support this use case. It stores strings using two arrays. The first is an array of offsets. The second is an array of bytes. To fetch the ith string Arrow will look up offsets[i] and offsets[i+1] to determine the range that needs to be fetched from the array of bytes. There are two string types in Arrow, "string" and "large_string". The "string" data type uses 4 byte signed integer offsets while the "large_string" data type uses 8 byte signed integer offsets. So it is not possible to create a "string" array with data containing more than 2 billion bytes. Now, this is not normally a problem. Arrow can fall back to a chunked array (which is why the 31 bit limit you reference isn't always such an issue). {code:java} >>> import pyarrow as pa >>> x = '0' * 1024 >>> y = [x] * (1024 * 1024 * 2) >>> len(y) 2097152 // # of strings >>> len(y) * 1024 2147483648 // # of bytes >>> a = pa.array(y) >>> len(a.chunks) 2 >>> len(a.chunks[0]) 2097151 >>> len(a.chunks[1]) 1 {code} However, it does seem that, if there are 2 billion strings (as opposed to just 2 billion bytes), the chunked array fallback is not applying. {code:java} >>> x = '0' * 8 >>> y = [x] * (1024 * 1024 * 1024 * 2) >>> len(y) 2147483648 >>> len(y) * 8 17179869184 >>> a = pa.array(y) Traceback (most recent call last): File "", line 1, in File "pyarrow\array.pxi", line 296, in pyarrow.lib.array File "pyarrow\array.pxi", line 39, in pyarrow.lib._sequence_to_array File "pyarrow\error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow\error.pxi", line 109, in pyarrow.lib.check_status pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more than 2147483646 child elements, got 2147483648 {code} This "should" be representable using a chunked array with two chunks. It is possible this is the source of your issue. Or maybe when reading from parquet the "fallback to chunked array" logic simply doesn't apply. I don't know the parquet code well enough. That is one of the reasons it would be helpful to have a reproducible test. It also might be easier to just write your parquet out to multiple files or multiple row groups. Both of these approaches should not only avoid this issue but also reduce the memory pressure when you are converting to pandas. was (Author: westonpace): The 31 bit limit you are referencing is not the 31 bit limit that is at play here and not really relevant. There is another 31 bit limit that has to do with how arrow stores strings. Parquet does not need to support random access of strings. The way it stores byte arrays & byte array lengths does not support random access. You could not fetch the ith string of a parquet encoded utf8 byte array. Arrow does need to support this use case. It stores strings using two arrays. The first is an array of offsets. The second is an array of bytes. To fetch the ith string Arrow will look up offsets[i] and offsets[i+1] to determine the range that needs to be fetched from the array of bytes. There are two string types in Arrow, "string" and "large_string". The "string" data type uses 4 byte signed integer offsets while the "large_string" data type uses 8 byte signed integer offsets. So it is not possible to create a "string" array with data containing more than 2 billion bytes. Now, this is not normally a problem. Arrow can fall back to a chunked array (which is why the 31 bit limit you reference isn't such an issue). {code:java} >>> import pyarrow as pa >>> x = '0' * 1024 >>> y = [x] * (1024 * 1024 * 2) >>> len(y) 2097152 // # of strings >>> len(y) * 1024 2147483648 // # of bytes >>> a = pa.array(y) >>> len(a.chunks) 2 >>> len(a.chunks[0]) 2097151 >>> len(a.chunks[1]) 1 {code} However, it does seem that, if there are 2 billion strings (as opposed to just 2 billion bytes), the chunked array fallback is not applying. {code:java} >>> x = '0' * 8 >>> y = [x] * (1024 * 1024 * 1024 * 2) >>> len(y) 2147483648 >>> len(y) * 8 17179869184 >>> a = pa.array(y) Traceback (most recent call last): File "", line 1, in File "pyarrow\array.pxi", line 296, in pyarrow.lib.array File "pyarrow\array.pxi", line 39, in pyarrow.lib._sequence_to_array File "pyarrow\error.pxi", line 122, in
[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings
[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278976#comment-17278976 ] Pac A. He edited comment on ARROW-11456 at 2/4/21, 5:09 PM: Unfortunately I have not been able to produce a reproducible result in a simple example despite multiple attempts and experiments. I read a dataframe with 10 string columns and 2 billion rows without issue. The issue reproduces only over my actual data. Nevertheless, obviously the exception traceback and the error message could still be indicative of what's causing it. Having 31-bit limit makes no sense to me. was (Author: apacman): Unfortunately I have not been able to produce a reproducible result in a simple example despite multiple attempts and experiments. I tried reading a dataframe with 10 string columns and 2 billion rows. The issue reproduces only over my actual data. Nevertheless, obviously the exception traceback and the error message could still be indicative of what's causing it. Having 31-bit limit makes no sense to me. > [Python] Parquet reader cannot read large strings > - > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.2.1 > python 3.8.6 >Reporter: Pac A. He >Priority: Major > > When reading a large parquet file, I have this error: > > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] > It's not even 32-bit as sizes are non-negative. > This problem started after I added a string column with 1.5 billion unique > rows. Each value was effectively a unique base64 encoded length 24 string. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings
[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278976#comment-17278976 ] Pac A. He edited comment on ARROW-11456 at 2/4/21, 5:09 PM: Unfortunately I have not been able to produce a reproducible result in a simple example despite multiple attempts and experiments. I read a dataframe with 10 string columns and 2 billion rows without issue. The issue reproduces only over my actual real-world data. Nevertheless, obviously the exception traceback and the error message could still be indicative of what's causing it. Having 31-bit limit makes no sense to me. was (Author: apacman): Unfortunately I have not been able to produce a reproducible result in a simple example despite multiple attempts and experiments. I read a dataframe with 10 string columns and 2 billion rows without issue. The issue reproduces only over my actual data. Nevertheless, obviously the exception traceback and the error message could still be indicative of what's causing it. Having 31-bit limit makes no sense to me. > [Python] Parquet reader cannot read large strings > - > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.2.1 > python 3.8.6 >Reporter: Pac A. He >Priority: Major > > When reading a large parquet file, I have this error: > > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] > It's not even 32-bit as sizes are non-negative. > This problem started after I added a string column with 1.5 billion unique > rows. Each value was effectively a unique base64 encoded length 24 string. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings
[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278976#comment-17278976 ] Pac A. He edited comment on ARROW-11456 at 2/4/21, 5:08 PM: Unfortunately I have not been able to produce a reproducible result in a simple example despite multiple attempts and experiments. I tried reading a dataframe with 10 string columns and 2 billion rows. The issue reproduces only over my actual data. Nevertheless, obviously the exception traceback and the error message could still be indicative of what's causing it. Having 31-bit limit makes no sense to me. was (Author: apacman): Unfortunately I have not been able to produce a reproducible result in a simple example despite multiple attempts and experiments. It reproduces only over my actual data. Nevertheless, obviously the exception traceback and the error message could still be indicative of what's causing it. Having 31-bit limit makes no sense to me. > [Python] Parquet reader cannot read large strings > - > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.2.1 > python 3.8.6 >Reporter: Pac A. He >Priority: Major > > When reading a large parquet file, I have this error: > > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] > It's not even 32-bit as sizes are non-negative. > This problem started after I added a string column with 1.5 billion unique > rows. Each value was effectively a unique base64 encoded length 24 string. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings
[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277234#comment-17277234 ] Pac A. He edited comment on ARROW-11456 at 2/2/21, 4:12 PM: For what it's worth, {{fastparquet}} v0.5.0 had no trouble at all reading such files. That's a workaround for now, if only for Python, until this issue is resolved. was (Author: apacman): For what it's worth, {{fastparquet}} v0.5.0 had no trouble at all reading such files. That's a workaround for now until this issue is resolved. > [Python] Parquet reader cannot read large strings > - > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.2.1 > python 3.8.6 >Reporter: Pac A. He >Priority: Major > > When reading a large parquet file, I have this error: > > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] > It's not even 32-bit as sizes are non-negative. > This problem started after I added a string column with 1.5 billion unique > rows. Each value was effectively a unique base64 encoded length 22 string. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings
[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276501#comment-17276501 ] Pac A. He edited comment on ARROW-11456 at 2/1/21, 5:21 PM: [~jorisvandenbossche] This is difficult in this case because the parquet is so large. What I can say is that this issue started *after I added a text string column with 1.5 billion unique rows. Each value was effectively a unique base64 encoded length 22 string*. I hope this helps. If you still need code, I can write a function to generate it. was (Author: apacman): [~jorisvandenbossche] This is very difficult in this case because the parquet is so large. What I can say is that this issue started *after I added a text string column with 1.3 billion unique rows. Each value was effectively a unique base64 encoded length 22 string*. I hope this helps. If you still need code, I can write a function to generate it. > [Python] Parquet reader cannot read large strings > - > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.2.1 > python 3.8.6 >Reporter: Pac A. He >Priority: Major > > When reading a large parquet file, I have this error: > > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] > It's not even 32-bit as sizes are non-negative. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings
[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276501#comment-17276501 ] Pac A. He edited comment on ARROW-11456 at 2/1/21, 5:21 PM: [~jorisvandenbossche] This is difficult in this case because the parquet is so large. What I can say is that this issue started *after I added a string column with 1.5 billion unique rows. Each value was effectively a unique base64 encoded length 22 string*. I hope this helps. If you still need code, I can write a function to generate it. was (Author: apacman): [~jorisvandenbossche] This is difficult in this case because the parquet is so large. What I can say is that this issue started *after I added a text string column with 1.5 billion unique rows. Each value was effectively a unique base64 encoded length 22 string*. I hope this helps. If you still need code, I can write a function to generate it. > [Python] Parquet reader cannot read large strings > - > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.2.1 > python 3.8.6 >Reporter: Pac A. He >Priority: Major > > When reading a large parquet file, I have this error: > > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] > It's not even 32-bit as sizes are non-negative. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings
[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276501#comment-17276501 ] Pac A. He edited comment on ARROW-11456 at 2/1/21, 5:20 PM: [~jorisvandenbossche] This is very difficult in this case because the parquet is so large. What I can say is that this issue started *after I added a text string column with 1.3 billion unique rows. Each value was effectively a unique base64 encoded length 22 string*. I hope this helps. If you still need code, I can write a function to generate it. was (Author: apacman): [~jorisvandenbossche] This is very difficult in this case because the parquet is so large. What I can say is that this issue started *after I added a text string column with 1.3 billion unique rows. Each value was effectively a unique base64 encoded length 22 string*. I hope this helps. > [Python] Parquet reader cannot read large strings > - > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.2.1 > python 3.8.6 >Reporter: Pac A. He >Priority: Major > > When reading a large parquet file, I have this error: > > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] > It's not even 32-bit as sizes are non-negative. -- This message was sent by Atlassian Jira (v8.3.4#803005)