jonded94 commented on issue #9370:
URL: https://github.com/apache/arrow-rs/issues/9370#issuecomment-3861143928

   I'm sadly not too well versed around the internals of `arrow-rs`, but let me 
paste what Claude Code vibed for me (sorry if you don't like this spam 😢).
   
   This is the debug summary:
   
   ```
     Root Cause                                                                 
                                                                                
                                                                                
                                             
                                                                                
                                                                                
                                                                                
                                             
     The bug occurs when using RowSelection with nested types (like 
List<String>) when:                                                             
                                                                                
                                                         
     1. A column has multiple pages in a row group                              
                                                                                
                                                                                
                                             
     2. The selected rows span across page boundaries                           
                                                                                
                                                                                
                                             
     3. The first page is entirely consumed during skip operations              
                                                                                
                                                                                
                                             
                                                                                
                                                                                
                                                                                
                                             
     Specific Scenario                                                          
                                                                                
                                                                                
                                             
                                                                                
                                                                                
                                                                                
                                             
     In the test case:                                                          
                                                                                
                                                                                
                                             
     - Row group 99 has 5120 rows                                               
                                                                                
                                                                                
                                             
     - Column l1.element (first list column) has only 13711 values in its first 
page, while other element columns have 224796 values per page                   
                                                                                
                                             
     - Selected rows are [2, 955]                                               
                                                                                
                                                                                
                                             
                                                                                
                                                                                
                                                                                
                                             
     The sequence that triggers the bug:                                        
                                                                                
                                                                                
                                             
     1. Skip 2 rows: First column consumes 60 of 13711 values                   
                                                                                
                                                                                
                                             
     2. Read 1 row (row 2): Returns 6 list elements correctly                   
                                                                                
                                                                                
                                             
     3. Skip 952 rows: First column consumes ALL remaining 13651 values (13711 
total)                                                                          
                                                                                
                                              
     4. Read 1 row (row 955):                                                   
                                                                                
                                                                                
                                             
       - has_next() reads a new page (12834 values)                             
                                                                                
                                                                                
                                             
       - BUT GenericColumnReader::read_records returns (1, 0, 0) - 1 record 
with 0 levels!                                                                  
                                                                                
                                                 
                                                                                
                                                                                
                                                                                
                                             
     Why This Happens                                                           
                                                                                
                                                                                
                                             
                                                                                
                                                                                
                                                                                
                                             
     The issue is in how the repetition level decoder's flush_partial() 
interacts with page boundaries:                                                 
                                                                                
                                                     
                                                                                
                                                                                
                                                                                
                                             
     has_next [element] -> needs new page, buffered=13711, decoded=13711        
                                                                                
                                                                                
                                             
     has_next [element] -> new page read, buffered=12834                        
                                                                                
                                                                                
                                             
     GenericColumnReader::read_records EXIT [element] -> (1, 0, 0)              
                                                                                
                                                                                
                                             
                                                                                
                                                                                
                                                                                
                                             
     After reading a new page, the read_rep_levels function returns 0 levels 
but the flush_partial() mechanism adds 1 record. This creates a situation where 
1 record is "counted" without any actual data being read from the new page.     
                                                
                                                                                
                                                                                
                                                                                
                                             
     Result                                                                     
                                                                                
                                                                                
                                             
                                                                                
                                                                                
                                                                                
                                             
     - First list column (l1): Returns 6 values with 1 record boundary 
(rep_level=0)                                                                   
                                                                                
                                                      
     - Other list columns: Return 273 values with 2 record boundaries           
                                                                                
                                                                                
                                             
     - This mismatch causes the "Not all children array length are the same!" 
error at struct_array.rs:127                                                    
                                                                                
                                               
                                                                                
                                                                                
                                                                                
                                             
     Affected Code Locations                                                    
                                                                                
                                                                                
                                             
                                                                                
                                                                                
                                                                                
                                             
     1. /parquet/src/column/reader.rs - GenericColumnReader::read_records and 
skip_records                                                                    
                                                                                
                                               
     2. /parquet/src/column/reader/decoder.rs - 
RepetitionLevelDecoderImpl::flush_partial                                       
                                                                                
                                                                             
                                                                                
                                                                                
                                                                                
                                             
     The bug is specifically in how has_partial state is managed across page 
boundaries when using row selection. When a page is exhausted during skip and a 
new page is read for subsequent read operations, the record counting logic can 
incorrectly "complete" a record without actually 
     reading any data.                         
   ```
   
   Debug output:
   
   
[claude-debug-output.txt](https://github.com/user-attachments/files/25131822/claude-debug-output.txt)
   
   Patch for enabling all the debug output:
   
   
[claude-debug-statements.patch](https://github.com/user-attachments/files/25131772/claude-debug-statements.patch)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to