[ 
https://issues.apache.org/jira/browse/ARROW-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai updated ARROW-17983:
-----------------------------
    Summary: [Parquet][C++][Python] "List index overflow" when read parquet 
file  (was: [Parquet][C++][Python] "List Index overflow" when read parquet file)

> [Parquet][C++][Python] "List index overflow" when read parquet file
> -------------------------------------------------------------------
>
>                 Key: ARROW-17983
>                 URL: https://issues.apache.org/jira/browse/ARROW-17983
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Parquet, Python
>            Reporter: Yibo Cai
>            Priority: Major
>
> From issue https://github.com/apache/arrow/issues/14229.
> The bug looks like this:
> - create a pandas dataframe with *one column* and {{n}} rows, {{n < 
> max(int32)}}
> - each elemenet is a list with {{m}} integers, {{m * n > max(int32)}}
> - save to a parquet file
> - reading from the parquet file fails with "OSError: List index overflow"
> See comment below on details to reproudce this bug:
> https://github.com/apache/arrow/issues/14229#issuecomment-1272223773
> Tested with a small dataset, the error might come from below code.
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/level_conversion.cc#L63-L64
> {{OffsetType}} is {{int32}}, but the loop is executed (and {{*offset}} is 
> incremented) {{m * n}} times which is beyond {{max(int32)}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to