petern48 commented on code in PR #18577:
URL: https://github.com/apache/datafusion/pull/18577#discussion_r2515746103
##########
datafusion/datasource-parquet/src/reader.rs:
##########
@@ -129,6 +130,17 @@ impl AsyncFileReader for ParquetFileReader {
}
}
+impl Drop for ParquetFileReader {
+ fn drop(&mut self) {
+ self.file_metrics
+ .scan_efficiency_ratio
+ .add_part(self.file_metrics.bytes_scanned.value());
+ self.file_metrics
+ .scan_efficiency_ratio
+ .add_total(self.partitioned_file.object_meta.size as usize);
Review Comment:
Pushed a fix for this. I also manually verified that the results are the
same.
```
tpchgen-cli -s 1 --format=parquet --output-dir ...
set datafusion.execution.target_partitions = 2;
explain analyze select * from '<path>/lineitem.parquet' where l_orderkey =
42;
-- scan_efficiency_ratio=0.83% (1919614/231669547)
set datafusion.execution.target_partitions = 1;
explain analyze select * from '<path>/lineitem.parquet' where l_orderkey =
42;
-- scan_efficiency_ratio=0.83% (1919614/231669547)
```
I was hoping to then write a test that would fail on the previous
implementation but pass, now. I found it difficult to write a test that does so
with the existing files in `datafusion/core/tests/data` (which are all fairly
small). Then I tried to writing a test that mimics this exactly, but found the
file was quite large (200+ MB), compared to the rest of the tests in the
directory.
<img width="962" height="115" alt="image"
src="https://github.com/user-attachments/assets/8258382d-a93b-4a42-90f0-df2861e94d1d"
/>
<details>
<summary>The test I wrote (not pushed)</summary>
```rust
#[tokio::test]
#[cfg_attr(tarpaulin, ignore)]
async fn parquet_scan_efficiency_ratio() {
let table_name = "tpch_lineitem_sf1";
let parquet_path = "tests/data/tpch_lineitem_sf1.parquet";
for num_partitions in [1, 2, 5] {
let config =
SessionConfig::new().with_target_partitions(num_partitions);
let ctx = SessionContext::new_with_config(config);
ctx.register_parquet(table_name, parquet_path,
ParquetReadOptions::default())
.await
.expect("register parquet table for explain analyze test");
let sql =
"EXPLAIN ANALYZE select * from tpch_lineitem_sf1 where
l_orderkey = 42;";
let actual = execute_to_batches(&ctx, sql).await;
let formatted = arrow::util::pretty::pretty_format_batches(&actual)
.unwrap()
.to_string();
assert_contains!(
&formatted,
"scan_efficiency_ratio=0.83% (1919614/231669547)"
);
}
```
</details>
How would you like to proceed? Do you still want a new test, or is the PR
good as it is? The test that existed before this changed passes before and
after.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]