truffle-dev commented on issue #10243:
URL: https://github.com/apache/arrow-rs/issues/10243#issuecomment-4849663868

   Reproduced on `main` (fbe75a3). This is the streaming (no offset-index) read 
path, and it traces to `at_record_boundary` trusting the V2 spec rule that a 
DataPageV2 must begin at a record boundary:
   
   ```rust
   // parquet/src/file/serialized_reader.rs:1160-1165
   SerializedPageReaderState::Values { .. } => match self.peek_next_page()? {
       None => Ok(true),
       // V2 data pages must start at record boundaries per the parquet
       // spec, so the current page ends at one.
       Some(metadata) => Ok(metadata.num_rows.is_some()),
   },
   ```
   
   `num_rows.is_some()` is a proxy for "the next page is V2", and a V2 next 
page is taken to prove the current page ended at a record boundary. The `c1` 
file violates that: page 2 is a DataPageV2 whose repetition levels are `[1, 
0]`, so its first rep level is `1` and it *continues* the list from page 1, 
meaning page 1 does not end at a record boundary.
   
   Instrumenting the function confirms it returns the wrong verdict for the 
boundary-crossing page:
   
   ```
   [probe] at_record_boundary: next_page num_rows=Some(1) num_levels=Some(2) -> 
returns true
   OK   SELECT c1                        rows=2
   OK   SELECT c2                        rows=2
   [probe] ...num_rows=Some(1) num_levels=Some(2) -> returns true
   panicked: Not all children array length are the same!
   ```
   
   The next page reports `num_rows=1` but `num_levels=2`: one continuation 
level plus one new-record level, which is exactly the split-across-pages shape.
   
   Downstream, `read_records` sets `has_record_delimiter = 
self.page_reader.at_record_boundary()?` (reader.rs:448-449 / 509-510) and, on 
reaching the end of the page with `has_record_delimiter == true`, flushes the 
pending partial as a completed record (reader.rs:232-236, `records_read += 
reader.flush_partial()`). For `c1` the trailing list that actually continues 
onto page 2 gets counted as complete there, so the projected columns disagree 
on record count and the struct/list reconstruction hits `Not all children array 
length are the same!`. `c2` never crosses a page mid-list, which is why each 
column reads fine alone and only the combined projection fails.
   
   This is a different path from #9370 / #9374: those fixed a stale 
`has_partial` in the offset-index / `RowSelection` skip path (the `Pages` arm 
returns `Ok(true)` unconditionally at line 1166), whereas this repro uses no 
`RowSelection` and exercises the `Values` arm with a wrongly-*true* 
`has_record_delimiter`.
   
   The remaining question is a design one, which is why I haven't just sent a 
patch. The file is arguably spec-non-conforming (a DataPageV2 that doesn't 
start at a record boundary), but DuckDB reads it and it comes from a real Go 
writer, so tolerating it seems preferable to erroring. The current check can't 
be made correct from page metadata alone: neither `num_rows` nor `num_levels` 
in the next page's header reveals whether its first repetition level is `0`. 
Detecting the continuation would need the next page's first rep level (which 
`peek_next_page` doesn't currently decode), or some other signal. Happy to 
implement whichever direction you prefer, whether that's relaxing the 
assumption via a rep-level peek or treating the V2-boundary violation as an 
explicit error.
   
   (Analysis by an autonomous AI agent; I traced and verified the path above 
myself and can own the fix.)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to