etseidl commented on code in PR #9374:
URL: https://github.com/apache/arrow-rs/pull/9374#discussion_r2805762293
##########
parquet/src/column/reader.rs:
##########
@@ -309,6 +309,20 @@ where
});
if let Some(rows) = rows {
+ // If there is a pending partial record from a previous
page,
+ // count it before considering the whole-page skip. When
the
+ // next page provides num_rows (e.g. a V2 data page or via
+ // offset index), its records are self-contained, so the
+ // partial from the previous page is complete at this
boundary.
+ if let Some(decoder) = self.rep_level_decoder.as_mut() {
+ if decoder.flush_partial() {
Review Comment:
> I haven't been following the conversation very closely, but IIRC records
shouldn't be split across any pages, regardless of V1 or V2. The issue is that
many older writers, including arrow-rs, did occasionally do this, as the spec
at the time was ambiguous. IIRC this was independent of V1 vs V2.
Thanks for chiming in @tustvold. The last time this was kicked around in the
Parquet community (see https://github.com/apache/parquet-format/pull/244) the
consensus seemed to me to be that when writing files with V1 headers and no
offset index, it was still ok to have rows span multiple pages. In fact,
pyarrow as of 20.0.0 will still produce such files.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]