scovich commented on code in PR #9374:
URL: https://github.com/apache/arrow-rs/pull/9374#discussion_r2804610423


##########
parquet/src/column/reader.rs:
##########
@@ -309,6 +309,20 @@ where
                 });
 
                 if let Some(rows) = rows {
+                    // If there is a pending partial record from a previous 
page,
+                    // count it before considering the whole-page skip. When 
the
+                    // next page provides num_rows (e.g. a V2 data page or via
+                    // offset index), its records are self-contained, so the
+                    // partial from the previous page is complete at this 
boundary.
+                    if let Some(decoder) = self.rep_level_decoder.as_mut() {
+                        if decoder.flush_partial() {

Review Comment:
   >  If all pages are V2, then has_partial will never be true, because V2 
pages must start on a new record (R=0). If all pages are V1 this will never 
trigger because num_rows will be None. This case only applies when switching 
from V1 to V2
   
   I know _nothing_ about v1 vs. v2 parquet, but the unit test says it only 
uses v2 pages, with the comment
   > parquet-rs writes only v2 data pages and no offset index is loaded, so 
at_record_boundary() returns false for non-last page
   
   So it sounds like there are multiple ways for decoding to end in the middle 
of a page (leaving `has_partial=true`), but the correct fix in all cases is to 
`flush_partial` upon reaching a new v2 page (because the page must start on a 
new record)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to