[ https://issues.apache.org/jira/browse/ARROW-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17568984#comment-17568984 ]
Joris Van den Bossche commented on ARROW-17137: ----------------------------------------------- [~SimonCW] thanks for the report! I can confirm the error on the latest master branch as well (on Linux). > [Python] Converting data frame to Table with large nested column fails > `Invalid Struct child array has length smaller than expected` > ------------------------------------------------------------------------------------------------------------------------------------ > > Key: ARROW-17137 > URL: https://issues.apache.org/jira/browse/ARROW-17137 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Simon Weiß > Priority: Major > Labels: python-conversion > > Hey, > I have a data frame for which one column is a nested struct array. Converting > it to a pyarrow.Table fails if the data frame gets too big. I could reproduce > the bug with a minimal example with anonymized data that is roughly similar > to mine. When I set, e.g., N_ROWS=500_000, or smaller, it is working fine. > > {code:java} > import pandas as pd > import pyarrow as pa > N_ROWS = 800_000 > item_record = { > "someImportantAssets": [ > { > "square": > "https://some.super.loooooooooong.link.com/withmany/lorem/upload/" > > "ipsum/stilllooooooooooonger/lorem/{someparameter}/156fdjjf644984dfdfaera64" > "/specificLink-i15348891" > } > ], > "id": "i15348891", > "title": "Some Long Item Title i15348891", > } > user_record = { > "userId": "faa4648-4964drf-64648fafa648-4648falj", > "data": [item_record for _ in range(24)], > } > df = pd.DataFrame([user_record for _ in range(N_ROWS)]) > table = pa.Table.from_pandas(df){code} > > {code:java} > Traceback (most recent call last): > File "/.../scratch/experiment_pq.py", line 23, in <module> > table = pa.Table.from_pandas(df) > File "pyarrow/table.pxi", line 3472, in pyarrow.lib.Table.from_pandas > File "pyarrow/table.pxi", line 3574, in pyarrow.lib.Table.from_arrays > File "pyarrow/table.pxi", line 2793, in pyarrow.lib.Table.validate > File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: List child array > invalid: Invalid: Struct child array #1 invalid: Invalid: List child array > invalid: Invalid: Struct child array #0 has length smaller than expected for > struct array (13338407 < 13338408) {code} > The length is always smaller than expected by 1. > > h2. Expected behavior: > Run without errors or fail with a better error message. > > h2. System Info and Versions: > Apple M1 Pro but also happened on amd64 Linux machine on AWS > > {code:java} > arrow-cpp 7.0.0 py39h8a997f0_8_cpu conda-forge > pyarrow 7.0.0 py39h3a11367_8_cpu conda-forge > python 3.9.7 h54d631c_3_cpython conda-forge > {code} > I could also reproduce with > {noformat} > pyarrow 8.0.0{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)