scovich commented on code in PR #9374:
URL: https://github.com/apache/arrow-rs/pull/9374#discussion_r2804610423
##########
parquet/src/column/reader.rs:
##########
@@ -309,6 +309,20 @@ where
});
if let Some(rows) = rows {
+ // If there is a pending partial record from a previous
page,
+ // count it before considering the whole-page skip. When
the
+ // next page provides num_rows (e.g. a V2 data page or via
+ // offset index), its records are self-contained, so the
+ // partial from the previous page is complete at this
boundary.
+ if let Some(decoder) = self.rep_level_decoder.as_mut() {
+ if decoder.flush_partial() {
Review Comment:
> If all pages are V2, then has_partial will never be true, because V2
pages must start on a new record (R=0). If all pages are V1 this will never
trigger because num_rows will be None. This case only applies when switching
from V1 to V2
I know _nothing_ about v1 vs. v2 parquet, but the unit test says it only
uses v2 pages, with the comment
> parquet-rs writes only v2 data pages and no offset index is loaded, so
at_record_boundary() returns false for non-last page
So it sounds like there are multiple ways for decoding to end in the middle
of a page (leaving `has_partial=true`), but the correct fix in all cases is to
`flush_partial` upon reaching a new v2 page (because the page must start on a
new record)?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]