jonded94 commented on issue #9370: URL: https://github.com/apache/arrow-rs/issues/9370#issuecomment-3861143928
I'm sadly not too well versed around the internals of `arrow-rs`, but let me
paste what Claude Code vibed for me (sorry if you don't like this spam 😢).
This is the debug summary:
```
Root Cause
The bug occurs when using RowSelection with nested types (like
List<String>) when:
1. A column has multiple pages in a row group
2. The selected rows span across page boundaries
3. The first page is entirely consumed during skip operations
Specific Scenario
In the test case:
- Row group 99 has 5120 rows
- Column l1.element (first list column) has only 13711 values in its first
page, while other element columns have 224796 values per page
- Selected rows are [2, 955]
The sequence that triggers the bug:
1. Skip 2 rows: First column consumes 60 of 13711 values
2. Read 1 row (row 2): Returns 6 list elements correctly
3. Skip 952 rows: First column consumes ALL remaining 13651 values (13711
total)
4. Read 1 row (row 955):
- has_next() reads a new page (12834 values)
- BUT GenericColumnReader::read_records returns (1, 0, 0) - 1 record
with 0 levels!
Why This Happens
The issue is in how the repetition level decoder's flush_partial()
interacts with page boundaries:
has_next [element] -> needs new page, buffered=13711, decoded=13711
has_next [element] -> new page read, buffered=12834
GenericColumnReader::read_records EXIT [element] -> (1, 0, 0)
After reading a new page, the read_rep_levels function returns 0 levels
but the flush_partial() mechanism adds 1 record. This creates a situation where
1 record is "counted" without any actual data being read from the new page.
Result
- First list column (l1): Returns 6 values with 1 record boundary
(rep_level=0)
- Other list columns: Return 273 values with 2 record boundaries
- This mismatch causes the "Not all children array length are the same!"
error at struct_array.rs:127
Affected Code Locations
1. /parquet/src/column/reader.rs - GenericColumnReader::read_records and
skip_records
2. /parquet/src/column/reader/decoder.rs -
RepetitionLevelDecoderImpl::flush_partial
The bug is specifically in how has_partial state is managed across page
boundaries when using row selection. When a page is exhausted during skip and a
new page is read for subsequent read operations, the record counting logic can
incorrectly "complete" a record without actually
reading any data.
```
Debug output:
[claude-debug-output.txt](https://github.com/user-attachments/files/25131822/claude-debug-output.txt)
Patch for enabling all the debug output:
[claude-debug-statements.patch](https://github.com/user-attachments/files/25131772/claude-debug-statements.patch)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
