etseidl commented on code in PR #9374:
URL: https://github.com/apache/arrow-rs/pull/9374#discussion_r2805762293


##########
parquet/src/column/reader.rs:
##########
@@ -309,6 +309,20 @@ where
                 });
 
                 if let Some(rows) = rows {
+                    // If there is a pending partial record from a previous 
page,
+                    // count it before considering the whole-page skip. When 
the
+                    // next page provides num_rows (e.g. a V2 data page or via
+                    // offset index), its records are self-contained, so the
+                    // partial from the previous page is complete at this 
boundary.
+                    if let Some(decoder) = self.rep_level_decoder.as_mut() {
+                        if decoder.flush_partial() {

Review Comment:
   > I haven't been following the conversation very closely, but IIRC records 
shouldn't be split across any pages, regardless of V1 or V2. The issue is that 
many older writers, including arrow-rs, did occasionally do this, as the spec 
at the time was ambiguous. IIRC this was independent of V1 vs V2.
   
   Thanks for chiming in @tustvold. The last time this was kicked around in the 
Parquet community (see https://github.com/apache/parquet-format/pull/244) the 
consensus seemed to me to be that when writing files with V1 headers and no 
offset index, it was still ok to have rows span multiple pages. In fact, 
pyarrow as of 20.0.0 will still produce such files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to