[jira] [Resolved] (ARROW-9790) [Rust] [Parquet] ParquetFileArrowReader fails to decode all pages if batches fall exactly on row group boundaries

Chao Sun (Jira) Wed, 19 Aug 2020 18:03:19 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chao Sun resolved ARROW-9790.
-----------------------------
    Fix Version/s: 2.0.0
       Resolution: Fixed

Issue resolved by pull request 8007
[https://github.com/apache/arrow/pull/8007]

> [Rust] [Parquet] ParquetFileArrowReader fails to decode all pages if batches 
> fall exactly on row group boundaries
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9790
>                 URL: https://issues.apache.org/jira/browse/ARROW-9790
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Andrew Lamb
>            Assignee: Andrew Lamb
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.0.0
>
>         Attachments: parquet_file_arrow_reader.zip
>
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> When I was reading a parquet file into RecordBatches using 
> {{ParquetFileArrowReader}} that had row groups that were 100,000 rows in 
> length with a batch size of 60,000, after reading 300,000 rows successfully, 
> I started seeing this error
> {code}
>  ParquetError("Parquet error: Not all children array length are the same!")
> {code}
> Upon investigation, I found that when reading with 
> {{ParquetFileArrowReader}}, if the parquet input file has multiple row 
> groups, and if a batch happens to end at the end of a row group for Int or 
> Float, no subsequent row groups are read
> Visually:
> {code}
> +-----+
> | RG1 |
> |     |
> +-----+  <-- If a batch ends exactly at the end of this row group (page), RG2 
> is never read
> +-----+
> | RG2 |
> |     |
> +-----+
> {code}
> A reproducer is attached. 20 values should be read by the 
> {{ParquetFileArrowReader}} regardless of the batch size. However, when using 
> batch sizes such as {{5}} or {{3}} (which fall on a boundary between row 
> groups) not all the rows are read. 
> To run the reproducer, decompress the attachment  
> [^parquet_file_arrow_reader.zip] and do `cargo run`
> The output is as follows:
> {code}
> wrote 20 rows in 4 row groups to /tmp/repro.parquet
> Size when reading with batch_size 100 : 20
> Size when reading with batch_size 7 : 20
> Size when reading with batch_size 5 : 5
> {code}
> The expected output is as follows (should always read 20 rows, regardless of 
> the batch size):
> {code}
> wrote 20 rows in 4 row groups to /tmp/repro.parquet
> Size when reading with batch_size 100 : 20
> Size when reading with batch_size 7 : 20
> Size when reading with batch_size 5 : 20
> {code}
> h2. Workaround
> Use a different batch size that will not fall on record batch boundaries



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-9790) [Rust] [Parquet] ParquetFileArrowReader fails to decode all pages if batches fall exactly on row group boundaries

Reply via email to