Re: [PR] feat(parquet): Implement `scan_efficiency_ratio` metric for parquet reading [datafusion]

via GitHub Tue, 11 Nov 2025 12:46:21 -0800


petern48 commented on code in PR #18577:
URL: https://github.com/apache/datafusion/pull/18577#discussion_r2515746103



##########
datafusion/datasource-parquet/src/reader.rs:
##########
@@ -129,6 +130,17 @@ impl AsyncFileReader for ParquetFileReader {
     }
 }
 
+impl Drop for ParquetFileReader {
+    fn drop(&mut self) {
+        self.file_metrics
+            .scan_efficiency_ratio
+            .add_part(self.file_metrics.bytes_scanned.value());
+        self.file_metrics
+            .scan_efficiency_ratio
+            .add_total(self.partitioned_file.object_meta.size as usize);

Review Comment:
   Pushed a fix for this. I also manually verified that the results are the 
same.
   
   ```
   tpchgen-cli -s 1 --format=parquet --output-dir ...
   
   set datafusion.execution.target_partitions = 2;
   explain analyze select * from '<path>/lineitem.parquet' where l_orderkey = 
42;
   -- scan_efficiency_ratio=0.83% (1919614/231669547)
   
   set datafusion.execution.target_partitions = 1;
   explain analyze select * from '<path>/lineitem.parquet' where l_orderkey = 
42;
   -- scan_efficiency_ratio=0.83% (1919614/231669547)
   ```
   
   I was hoping to then write a test that would fail on the previous 
implementation but pass, now. I found it difficult to write a test that does so 
with the existing files in `datafusion/core/tests/data` (which are all fairly 
small). Then I tried to writing a test that mimics this exactly, but found the 
file was quite large (200+ MB), compared to the rest of the tests in the 
directory.
   
   <img width="962" height="115" alt="image" 
src="https://github.com/user-attachments/assets/8258382d-a93b-4a42-90f0-df2861e94d1d";
 />
   
   <details>
       <summary>The test I wrote (not pushed)</summary>
   
   ```rust
   #[tokio::test]
   #[cfg_attr(tarpaulin, ignore)]
   async fn parquet_scan_efficiency_ratio() {
       let table_name = "tpch_lineitem_sf1";
       let parquet_path = "tests/data/tpch_lineitem_sf1.parquet";
   
       for num_partitions in [1, 2, 5] {
           let config = 
SessionConfig::new().with_target_partitions(num_partitions);
           let ctx = SessionContext::new_with_config(config);
           ctx.register_parquet(table_name, parquet_path, 
ParquetReadOptions::default())
               .await
               .expect("register parquet table for explain analyze test");
   
           let sql =
               "EXPLAIN ANALYZE select * from tpch_lineitem_sf1 where 
l_orderkey = 42;";
           let actual = execute_to_batches(&ctx, sql).await;
           let formatted = arrow::util::pretty::pretty_format_batches(&actual)
               .unwrap()
               .to_string();
   
           assert_contains!(
               &formatted,
               "scan_efficiency_ratio=0.83% (1919614/231669547)"
           );
       }
   ```
   </details>
   
   How would you like to proceed? Do you still want a new test, or is the PR 
good as it is? The test that existed before this changed passes before and 
after.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat(parquet): Implement `scan_efficiency_ratio` metric for parquet reading [datafusion]

Reply via email to