[ 
https://issues.apache.org/jira/browse/ARROW-18198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633831#comment-17633831
 ] 

David Dali Susanibar Arce commented on ARROW-18198:
---------------------------------------------------

There isn't problem for binary file with less than rowCount <= 2048.

There is a problem with the Validity Buffer, for example for 2049 rows 
initially there is assigned 504 buffer size, but at the end is requested 512 
length size.

Need to continue reviewing for changes needed.

 

> IndexOutOfBoundsException when loading compressed IPC format
> ------------------------------------------------------------
>
>                 Key: ARROW-18198
>                 URL: https://issues.apache.org/jira/browse/ARROW-18198
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Java
>    Affects Versions: 4.0.1, 9.0.0, 10.0.0
>         Environment: Linux and Windows.
> Apache Arrow Java version: 10.0.0, 9.0.0, 4.0.1.
> Pandas 1.4.2 using pyarrow 8.0.0 (anaconda3-2022.05)
>            Reporter: Georeth Zhou
>            Priority: Major
>
> I encountered this bug when I loaded a dataframe stored in the Arrow IPC 
> format.
>  
> {code:java}
> // Java Code from "Apache Arrow Java Cookbook"
> File file = new File("example.arrow");
> try (
>         BufferAllocator rootAllocator = new RootAllocator();
>         FileInputStream fileInputStream = new FileInputStream(file);
>         ArrowFileReader reader = new 
> ArrowFileReader(fileInputStream.getChannel(), rootAllocator)
> ) {
>     System.out.println("Record batches in file: " + 
> reader.getRecordBlocks().size());
>     for (ArrowBlock arrowBlock : reader.getRecordBlocks()) {
>         reader.loadRecordBatch(arrowBlock);
>         VectorSchemaRoot vectorSchemaRootRecover = 
> reader.getVectorSchemaRoot();
>         System.out.print(vectorSchemaRootRecover.contentToTSVString());
>     }
> } catch (IOException e) {
>     e.printStackTrace();
> } {code}
> Call stack:
> {noformat}
> Exception in thread "main" java.lang.IndexOutOfBoundsException: index: 0, 
> length: 2048 (expected: range(0, 2024))
>     at org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:701)
>     at org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:955)
>     at 
> org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:451)
>     at 
> org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:732)
>     at 
> org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:240)
>     at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:86)
>     at 
> org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:220)
>     at 
> org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(ArrowFileReader.java:166)
>     at 
> org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch(ArrowFileReader.java:197){noformat}
> This bug can be reproduced by a simple dataframe created by pandas:
>  
> {code:java}
> pd.DataFrame({'a': range(10000)}).to_feather('example.arrow') {code}
> Pandas compresses the dataframe by default. If the compression is turned off, 
> Java can load the dataframe. Thus, I guess the bounds checking code is buggy 
> when loading compressed file.
>  
> That dataframe can be loaded in polars, pandas and pyarrow, so it's unlikely 
> to be a pandas bug.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to