kumarUjjawal commented on code in PR #9540:
URL: https://github.com/apache/arrow-rs/pull/9540#discussion_r3451185026
##########
parquet/src/arrow/arrow_reader/statistics.rs:
##########
@@ -1495,6 +1501,45 @@ impl<'a> StatisticsConverter<'a> {
})
}
+ /// Create a new `StatisticsConverter` from a Parquet leaf column index
directly.
+ ///
+ /// Unlike [`Self::try_new`], this constructor bypasses schema resolution
and
+ /// accepts a Parquet column index directly. This is useful for nested
fields
+ /// (e.g., struct fields) where the caller has already resolved the mapping
+ /// from the Arrow field to the Parquet leaf column.
+ ///
+ /// # Arguments
+ ///
Review Comment:
should we make a note of the fact that:
`arrow_field` must describe the same column as `parquet_column_index`. If
they don't match, the statistics come back empty (null) instead of returning an
error.
##########
parquet/tests/arrow_reader/statistics.rs:
##########
@@ -2899,6 +2899,52 @@ mod test {
}
}
Review Comment:
should we extend the test to add:
- a non-Int32 leaf (e.g. Date64 or Decimal) to confirm `physical_type` is
threaded correctly, and
- an out-of-bounds index to cover the new error branch.
- A page-level check (data_page_mins/maxes) would be nice to have
##########
parquet/src/arrow/arrow_reader/statistics.rs:
##########
@@ -1495,6 +1501,45 @@ impl<'a> StatisticsConverter<'a> {
})
}
+ /// Create a new `StatisticsConverter` from a Parquet leaf column index
directly.
+ ///
+ /// Unlike [`Self::try_new`], this constructor bypasses schema resolution
and
+ /// accepts a Parquet column index directly. This is useful for nested
fields
+ /// (e.g., struct fields) where the caller has already resolved the mapping
+ /// from the Arrow field to the Parquet leaf column.
+ ///
+ /// # Arguments
+ ///
+ /// * `parquet_column_index` - The index of the leaf column in the Parquet
schema
+ /// * `arrow_field` - The Arrow field describing the column's data type
+ /// * `parquet_schema` - The Parquet schema descriptor (used to look up
the physical type)
+ ///
+ /// # Errors
+ ///
+ /// * If the `parquet_column_index` is out of bounds
+ pub fn from_column_index(
+ parquet_column_index: usize,
+ arrow_field: &'a Field,
+ parquet_schema: &'a SchemaDescriptor,
+ ) -> Result<Self> {
+ if parquet_column_index >= parquet_schema.columns().len() {
+ return Err(arrow_err!(format!(
+ "Parquet column index {} out of bounds, max {}",
Review Comment:
max reads as "highest allowed index", but it's actually the column count
(count 1 → only index 0 is valid).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]