jonded94 opened a new pull request, #9374:
URL: https://github.com/apache/arrow-rs/pull/9374
# Which issue does this PR close?
- Closes #9370 .
# Rationale for this change
The bug occurs when using RowSelection with nested types (like List<String>)
when:
1. A column has multiple pages in a row group
2. The selected rows span across page boundaries
3. The first page is entirely consumed during skip operations
The issue was in .../arrow-rs/parquet/src/column/reader.rs:287-382
(skip_records function).
Root cause: When skip_records completed successfully after crossing page
boundaries, the has_partial state in the RepetitionLevelDecoder could
incorrectly remain true. This happened when:
The skip operation exhausted a page where has_record_delimiter was false
The skip found the remaining records on the next page by counting a
delimiter at index 0
When a subsequent read_records(1) was called, the stale has_partial=true
state caused count_records to incorrectly interpret the first repetition level
(0) at index 0 as ending a "phantom" partial record, returning (1 record, 0
levels, 0 values) instead of properly reading the
actual record data.
For a more descriptive explanation, look here:
https://github.com/apache/arrow-rs/issues/9370#issuecomment-3861143928
# What changes are included in this PR?
Added code at the end of skip_records to reset the partial record state when
all requested records have been successfully skipped.
This ensures that after skip_records completes, we're at a clean record
boundary with no lingering partial record state, fixing the array length
mismatch in StructArrayReader.
# Are these changes tested?
Unfortunately I was unable to come up with a test that can show this issue
being fixed, since it requires a very particular state of data pages and row
skips to be "just right". I only was able to see this issue (and the issue
being fixed by this PR) with a very specific parquet file I'm unfortunately not
allowed to share:
https://github.com/apache/arrow-rs/issues/9370#issue-3907106841
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]