etseidl commented on code in PR #9374:
URL: https://github.com/apache/arrow-rs/pull/9374#discussion_r2805223807


##########
parquet/src/column/reader.rs:
##########
@@ -309,6 +309,20 @@ where
                 });
 
                 if let Some(rows) = rows {
+                    // If there is a pending partial record from a previous 
page,
+                    // count it before considering the whole-page skip. When 
the
+                    // next page provides num_rows (e.g. a V2 data page or via
+                    // offset index), its records are self-contained, so the
+                    // partial from the previous page is complete at this 
boundary.
+                    if let Some(decoder) = self.rep_level_decoder.as_mut() {
+                        if decoder.flush_partial() {

Review Comment:
   > I know _nothing_ about v1 vs. v2 parquet, but the unit test says it only 
uses v2 pages, with the comment
   
   The original comment was for an earlier revision of the test which mixed V1 
and V2 pages.
   
   > So it sounds like there are multiple ways for decoding to end in the 
middle of a page (leaving has_partial=true), but the correct fix in all cases 
is to flush_partial upon reaching a new v2 page (because the page must start on 
a new record)?
   
   The issue is decoding hitting the _end_ of a page. There's no way of knowing 
if the page ends at a record boundary (unless the page is the last one in a 
column chunk), so the `has_partial` flag is set and the number of records read 
is off by one. We should be able to reason that if V2 headers or the offset 
index are present we _are_ at a record boundary, but unfortunately earlier 
writers (including the one in this crate) did not universally follow the spec. 
As a consequence, `at_record_boundary`, which is used to set 
`has_record_delimiter`, is only `true` at a column chunk boundary (i.e. it was 
decided to not use V2/offset index as a guarantor of page boundaries being 
record boundaries). 
   
   The current fix, however, _does_ make that assumption (as does the proposed 
alternate fix). I had proposed simply not allowing the whole-page-skip 
optimization in the presence of repetition, and instead continuing to decode 
repetition levels, but that solution doesn't work when whole pages have not 
been fetched based on the offset index (found this when an async reader test of 
row skipping failed). I've put up a kludgey fix in #9406 which at least passes 
all tests, but is likely still too brittle.
   
   I do think we likely need to add a test (if one doesn't already exist), for 
correctly skipping records when pages do _not_ start at a record boundary (as 
is still allowed for V1 pages with no offset index).
   
   As I said earlier, page spanning records are the bane of my existence.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to