kosiew commented on code in PR #22857:
URL: https://github.com/apache/datafusion/pull/22857#discussion_r3425258959


##########
datafusion/datasource-parquet/src/metrics.rs:
##########
@@ -71,6 +71,9 @@ pub struct ParquetFileMetrics {
     pub page_index_pages_pruned: PruningMetrics,
     /// Total time spent evaluating parquet page index filters
     pub page_index_eval_time: Time,
+    /// Number of times page index I/O was skipped because row-group statistics
+    /// already proved page index could not prune further
+    pub page_index_load_skipped: Count,

Review Comment:
   Thanks for adding the regression coverage. One thing that concerns me here 
is that `ParquetFileMetrics` is a public struct with public fields, so adding 
`pub page_index_load_skipped: Count` becomes a semver-breaking API change. 
Downstream users can construct this struct with literals or match on its 
fields, and CI is already flagging this.
   
   Could we avoid adding a new public field just for the test assertion?
   
   A non-breaking alternative would be to register the metric through a 
crate-private helper, similar to 
`add_page_index_pages_skipped_by_fully_matched(...)`, for example:
   
   ```rust
   pub(crate) fn add_page_index_load_skipped(
       metrics: &ExecutionPlanMetricsSet,
       partition: usize,
       filename: &str,
       n: usize,
   ) {
       if n == 0 {
           return;
       }
   
       let count = MetricBuilder::new(metrics)
           .with_new_label("filename", filename.to_string())
           .with_type(MetricType::Summary)
           .counter("page_index_load_skipped", partition);
       count.add(n);
   }
   ```
   
   Then the skip branch in `opener/mod.rs` can record the metric using the 
existing `PreparedParquetOpen` context (`metrics`, `partition_index`, and 
`file_name`), and the test can continue asserting it via 
`ExecutionPlanMetricsSet::sum_by_name("page_index_load_skipped")`.
   
   That keeps the regression coverage while avoiding a public API change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to