[jira] [Commented] (ARROW-17983) [Parquet][C++][Python] "List index overflow" when read parquet file

2022-10-17 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17619234#comment-17619234
 ] 

Micah Kornfield commented on ARROW-17983:
-

IIRC, I think offset type here is inferred from the schema (i.e. List vs 
LargeList) that we are trying to read back into, once offsets reaches int32 max 
we can't return, since the reading path doesn't support chunking at the moment.

Two options to fix this seems to either be:
1.  Infer LargeList should be used based on RowGroup/File statistics.
2. Allow overriding the schema (this might already be an option) to take 
LargeList override.
3. Modify code to allow for chunking arrays (I seem to recall this would be a 
fare amount of work based on current assumption but its been a while since I 
dug into the code).

I seem to recall someone tried prototyping 2, recently but I'm having trouble 
finding the thread/JIRA at the moment.

> [Parquet][C++][Python] "List index overflow" when read parquet file
> ---
>
> Key: ARROW-17983
> URL: https://issues.apache.org/jira/browse/ARROW-17983
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet, Python
>Reporter: Yibo Cai
>Priority: Major
>
> From issue https://github.com/apache/arrow/issues/14229.
> The bug looks like this:
> - create a pandas dataframe with *one column* and {{n}} rows, {{n < 
> max(int32)}}
> - each elemenet is a list with {{m}} integers, {{m * n > max(int32)}}
> - save to a parquet file
> - reading from the parquet file fails with "OSError: List index overflow"
> See comment below on details to reproudce this bug:
> https://github.com/apache/arrow/issues/14229#issuecomment-1272223773
> Tested with a small dataset, the error might come from below code.
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/level_conversion.cc#L63-L64
> {{OffsetType}} is {{int32}}, but the loop is executed (and {{*offset}} is 
> incremented) {{m * n}} times which is beyond {{max(int32)}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17983) [Parquet][C++][Python] "List index overflow" when read parquet file

2022-10-17 Thread Yibo Cai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17618804#comment-17618804
 ] 

Yibo Cai commented on ARROW-17983:
--

cc [~emkornfi...@gmail.com] for comments.

> [Parquet][C++][Python] "List index overflow" when read parquet file
> ---
>
> Key: ARROW-17983
> URL: https://issues.apache.org/jira/browse/ARROW-17983
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet, Python
>Reporter: Yibo Cai
>Priority: Major
>
> From issue https://github.com/apache/arrow/issues/14229.
> The bug looks like this:
> - create a pandas dataframe with *one column* and {{n}} rows, {{n < 
> max(int32)}}
> - each elemenet is a list with {{m}} integers, {{m * n > max(int32)}}
> - save to a parquet file
> - reading from the parquet file fails with "OSError: List index overflow"
> See comment below on details to reproudce this bug:
> https://github.com/apache/arrow/issues/14229#issuecomment-1272223773
> Tested with a small dataset, the error might come from below code.
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/level_conversion.cc#L63-L64
> {{OffsetType}} is {{int32}}, but the loop is executed (and {{*offset}} is 
> incremented) {{m * n}} times which is beyond {{max(int32)}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)