[ https://issues.apache.org/jira/browse/ARROW-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yibo Cai updated ARROW-17983: ----------------------------- Summary: [Parquet][C++][Python] "List index overflow" when read parquet file (was: [Parquet][C++][Python] "List Index overflow" when read parquet file) > [Parquet][C++][Python] "List index overflow" when read parquet file > ------------------------------------------------------------------- > > Key: ARROW-17983 > URL: https://issues.apache.org/jira/browse/ARROW-17983 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Parquet, Python > Reporter: Yibo Cai > Priority: Major > > From issue https://github.com/apache/arrow/issues/14229. > The bug looks like this: > - create a pandas dataframe with *one column* and {{n}} rows, {{n < > max(int32)}} > - each elemenet is a list with {{m}} integers, {{m * n > max(int32)}} > - save to a parquet file > - reading from the parquet file fails with "OSError: List index overflow" > See comment below on details to reproudce this bug: > https://github.com/apache/arrow/issues/14229#issuecomment-1272223773 > Tested with a small dataset, the error might come from below code. > https://github.com/apache/arrow/blob/master/cpp/src/parquet/level_conversion.cc#L63-L64 > {{OffsetType}} is {{int32}}, but the loop is executed (and {{*offset}} is > incremented) {{m * n}} times which is beyond {{max(int32)}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)