westonpace commented on issue #32439:
URL: https://github.com/apache/arrow/issues/32439#issuecomment-1588196555
The issues is going to happen anytime a single string column ends up with
more than 2^31 characters. So, in OPs reproduction the column `square` has 161
characters per string and 800,000 * 24 strings which is `3,091,200,000`
characters. 2^31 is `2,147,483,648`. At this point we have to split the
resulting array into chunks (or use the large_string data type but that has
issues of its own).
This "breaking unexpectedly large columns into chunks" behavior is rather
tricky and it appears we are doing something wrong when working with lists of
struct arrays. Here's a compact reproducer (that only has 3 rows):
```
import pyarrow as pa
import pandas as pd
x = "0" * 1000000000
df = pd.DataFrame({"strings": [x, x, x]})
tab = pa.Table.from_pandas(df)
print(tab.column(0).num_chunks)
struct = {"struct_field": x}
df = pd.DataFrame({"structs": [struct, struct, struct]})
tab = pa.Table.from_pandas(df)
print(tab.column(0).num_chunks)
lists = [x]
df = pd.DataFrame({"lists": [lists, lists, lists]})
tab = pa.Table.from_pandas(df)
print(tab.column(0).num_chunks)
los = [struct]
df = pd.DataFrame({"los": [los, los, los]})
tab = pa.Table.from_pandas(df)
print(tab.column(0).num_chunks)
```
It seems the struct array has length 3. Meanwhile, it's child, the string
array, has length 2 (because it had to be broken into 2 chunks. The first
chunk has the first 2 values and the second chunk has the third).
So if someone wanted to investigate this I would recommend starting by
looking at the conversion from pandas code and see how the struct array and
list arrays are handling the case where their children is converted into
multiple chunks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]