jonded94 opened a new pull request, #9374:
URL: https://github.com/apache/arrow-rs/pull/9374

   # Which issue does this PR close?
   
   - Closes #9370 .
   
   # Rationale for this change
   
   The bug occurs when using RowSelection with nested types (like List<String>) 
when:                                                                           
                                                                                
                                           
     1. A column has multiple pages in a row group                              
                                                                                
                                                                                
                                             
     2. The selected rows span across page boundaries                           
                                                                                
                                                                                
                                             
     3. The first page is entirely consumed during skip operations
   
   The issue was in .../arrow-rs/parquet/src/column/reader.rs:287-382 
(skip_records function).
   
   Root cause: When skip_records completed successfully after crossing page 
boundaries, the has_partial state in the RepetitionLevelDecoder could 
incorrectly remain true. This happened when:
   
   The skip operation exhausted a page where has_record_delimiter was false
   The skip found the remaining records on the next page by counting a 
delimiter at index 0
   When a subsequent read_records(1) was called, the stale has_partial=true 
state caused count_records to incorrectly interpret the first repetition level 
(0) at index 0 as ending a "phantom" partial record, returning (1 record, 0 
levels, 0 values) instead of properly reading the
   actual record data.
   
   For a more descriptive explanation, look here: 
https://github.com/apache/arrow-rs/issues/9370#issuecomment-3861143928
   
   # What changes are included in this PR?
   
   Added code at the end of skip_records to reset the partial record state when 
all requested records have been successfully skipped.
   
   This ensures that after skip_records completes, we're at a clean record 
boundary with no lingering partial record state, fixing the array length 
mismatch in StructArrayReader.
   
   # Are these changes tested?
   
   Unfortunately I was unable to come up with a test that can show this issue 
being fixed, since it requires a very particular state of data pages and row 
skips to be "just right". I only was able to see this issue (and the issue 
being fixed by this PR) with a very specific parquet file I'm unfortunately not 
allowed to share: 
https://github.com/apache/arrow-rs/issues/9370#issue-3907106841


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to