etseidl commented on code in PR #9374:
URL: https://github.com/apache/arrow-rs/pull/9374#discussion_r2805223807
##########
parquet/src/column/reader.rs:
##########
@@ -309,6 +309,20 @@ where
});
if let Some(rows) = rows {
+ // If there is a pending partial record from a previous
page,
+ // count it before considering the whole-page skip. When
the
+ // next page provides num_rows (e.g. a V2 data page or via
+ // offset index), its records are self-contained, so the
+ // partial from the previous page is complete at this
boundary.
+ if let Some(decoder) = self.rep_level_decoder.as_mut() {
+ if decoder.flush_partial() {
Review Comment:
> I know _nothing_ about v1 vs. v2 parquet, but the unit test says it only
uses v2 pages, with the comment
The original comment was for an earlier revision of the test which mixed V1
and V2 pages.
> So it sounds like there are multiple ways for decoding to end in the
middle of a page (leaving has_partial=true), but the correct fix in all cases
is to flush_partial upon reaching a new v2 page (because the page must start on
a new record)?
The issue is decoding hitting the _end_ of a page. There's no way of knowing
if the page ends at a record boundary (unless the page is the last one in a
column chunk), so the `has_partial` flag is set and the number of records read
is off by one. We should be able to reason that if V2 headers or the offset
index are present we _are_ at a record boundary, but unfortunately earlier
writers (including the one in this crate) did not universally follow the spec.
As a consequence, `at_record_boundary`, which is used to set
`has_record_delimiter`, is only `true` at a column chunk boundary (i.e. it was
decided to not use V2/offset index as a guarantor of page boundaries being
record boundaries).
The current fix, however, _does_ make that assumption (as does the proposed
alternate fix). I had proposed simply not allowing the whole-page-skip
optimization in the presence of repetition, and instead continuing to decode
repetition levels, but that solution doesn't work when whole pages have not
been fetched based on the offset index (found this when an async reader test of
row skipping failed). I've put up a kludgey fix in #9406 which at least passes
all tests, but is likely still too brittle.
I do think we likely need to add a test (if one doesn't already exist), for
correctly skipping records when pages do _not_ start at a record boundary (as
is still allowed for V1 pages with no offset index).
As I said earlier, page spanning records are the bane of my existence.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]