[ https://issues.apache.org/jira/browse/ARROW-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chao Sun resolved ARROW-9790. ----------------------------- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8007 [https://github.com/apache/arrow/pull/8007] > [Rust] [Parquet] ParquetFileArrowReader fails to decode all pages if batches > fall exactly on row group boundaries > ----------------------------------------------------------------------------------------------------------------- > > Key: ARROW-9790 > URL: https://issues.apache.org/jira/browse/ARROW-9790 > Project: Apache Arrow > Issue Type: Bug > Reporter: Andrew Lamb > Assignee: Andrew Lamb > Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Attachments: parquet_file_arrow_reader.zip > > Time Spent: 2.5h > Remaining Estimate: 0h > > When I was reading a parquet file into RecordBatches using > {{ParquetFileArrowReader}} that had row groups that were 100,000 rows in > length with a batch size of 60,000, after reading 300,000 rows successfully, > I started seeing this error > {code} > ParquetError("Parquet error: Not all children array length are the same!") > {code} > Upon investigation, I found that when reading with > {{ParquetFileArrowReader}}, if the parquet input file has multiple row > groups, and if a batch happens to end at the end of a row group for Int or > Float, no subsequent row groups are read > Visually: > {code} > +-----+ > | RG1 | > | | > +-----+ <-- If a batch ends exactly at the end of this row group (page), RG2 > is never read > +-----+ > | RG2 | > | | > +-----+ > {code} > A reproducer is attached. 20 values should be read by the > {{ParquetFileArrowReader}} regardless of the batch size. However, when using > batch sizes such as {{5}} or {{3}} (which fall on a boundary between row > groups) not all the rows are read. > To run the reproducer, decompress the attachment > [^parquet_file_arrow_reader.zip] and do `cargo run` > The output is as follows: > {code} > wrote 20 rows in 4 row groups to /tmp/repro.parquet > Size when reading with batch_size 100 : 20 > Size when reading with batch_size 7 : 20 > Size when reading with batch_size 5 : 5 > {code} > The expected output is as follows (should always read 20 rows, regardless of > the batch size): > {code} > wrote 20 rows in 4 row groups to /tmp/repro.parquet > Size when reading with batch_size 100 : 20 > Size when reading with batch_size 7 : 20 > Size when reading with batch_size 5 : 20 > {code} > h2. Workaround > Use a different batch size that will not fall on record batch boundaries -- This message was sent by Atlassian Jira (v8.3.4#803005)