etseidl commented on code in PR #9374:
URL: https://github.com/apache/arrow-rs/pull/9374#discussion_r2805671180
##########
parquet/src/column/reader.rs:
##########
@@ -309,6 +309,20 @@ where
});
if let Some(rows) = rows {
+ // If there is a pending partial record from a previous
page,
+ // count it before considering the whole-page skip. When
the
+ // next page provides num_rows (e.g. a V2 data page or via
+ // offset index), its records are self-contained, so the
+ // partial from the previous page is complete at this
boundary.
+ if let Some(decoder) = self.rep_level_decoder.as_mut() {
+ if decoder.flush_partial() {
Review Comment:
> Datapoint: The file that lead to the original error message was written
with arrow-rs version 57.1.0: [parquet
viewer](https://private-user-images.githubusercontent.com/30271979/546253888-ec67ea13-1ead-4430-af64-041773c38ecc.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NzEwMDc2NzEsIm5iZiI6MTc3MTAwNzM3MSwicGF0aCI6Ii8zMDI3MTk3OS81NDYyNTM4ODgtZWM2N2VhMTMtMWVhZC00NDMwLWFmNjQtMDQxNzczYzM4ZWNjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNjAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjYwMjEzVDE4MjkzMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTZjZGRkY2MzM2Y1MTA2ZjdhNzE3ZmEyN2Q2ZmQ3NWM1MmE1ODZlYTg4N2RiYWE1YzRjODE3M2Y4ZjM2MzVlODgmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.5Iz9mldg3Z7bQCkvSEaEe7HYRjmdBYU7xTNnXI_tfM0)
and https://github.com/apache/arrow-rs/issues/9370#issuecomment-3889847488. In
any case, at le
ast at my company we probably have a few PiB of data written with this or an
even earlier version.
To be fair, neither the test data, nor almost certainly your data, have rows
that span pages. Rather, it's the assumption that the data _may_ contain such
rows that is leading to the incorrect skip behavior.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]