2010YOUY01 commented on code in PR #18577:
URL: https://github.com/apache/datafusion/pull/18577#discussion_r2509906685
##########
datafusion/datasource-parquet/src/reader.rs:
##########
@@ -129,6 +130,17 @@ impl AsyncFileReader for ParquetFileReader {
}
}
+impl Drop for ParquetFileReader {
+ fn drop(&mut self) {
+ self.file_metrics
+ .scan_efficiency_ratio
+ .add_part(self.file_metrics.bytes_scanned.value());
+ self.file_metrics
+ .scan_efficiency_ratio
+ .add_total(self.partitioned_file.object_meta.size as usize);
Review Comment:
I vaguely remember each parquet reader is possible to read only a sub-range
of a file, if that's the case here we might double count the total size.
I verified with the following queries -- The `l_orderkey` is globally
ordered, so if we set predicate `l_orderkey=42` then only the first row group
will be scanned. if we set partition to 2, then there are two parquet reader
instance, each reading half of the file range, and the resulting
`scan_efficiency_ratio` should be same even if they're executed under different
partitioning, now they're different.
Data is generated with
https://github.com/clflushopt/tpchgen-rs/tree/main/tpchgen-cli, in parquet, sf1
```
-- If target_partitions=1, the ratio is 0.86%
-- If target_partitions=2, the ratio is 0.43%
> set datafusion.execution.target_partitions = 2;
0 row(s) fetched.
Elapsed 0.000 seconds.
> explain analyze select * from
'/Users/yongting/data/tpch_sf1/lineitem.parquet' where l_orderkey = 42;
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---+
| plan_type | plan
|
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---+
| Plan with Metrics | CoalesceBatchesExec: target_batch_size=8192,
metrics=[output_rows=0, elapsed_compute=376ns, output_bytes=0.0 B]
|
| | FilterExec: l_orderkey@0 = 42,
metrics=[output_rows=0, elapsed_compute=25.835µs, output_bytes=0.0 B,
selectivity=0% (0/20096)]
|
| | DataSourceExec: file_groups={2 groups:
[[Users/yongting/data/tpch_sf1/lineitem.parquet:0..115375407],
[Users/yongting/data/tpch_sf1/lineitem.parquet:115375407..230750813]]},
projection=[l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity,
l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate,
l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment],
file_type=parquet, predicate=l_orderkey@0 = 42,
pruning_predicate=l_orderkey_null_count@2 != row_count@3 AND l_orderkey_min@0
<= 42 AND 42 <= l_orderkey_max@1, required_guarantees=[l_orderkey in (42)],
metrics=[output_rows=20096, elapsed_compute=2ns, output_bytes=6.6 MB,
files_ranges_pruned_statistics=2 total → 2 matched,
row_groups_pruned_statistics=49 total → 1 matched,
row_groups_pruned_bloom_filter=1 total → 1 matched,
page_index_rows_pruned=123010 total → 20096 matched, bytes_scanned=1994659,
metadata_load_time=123.46µs, scan_efficiency_ratio=0.43% (1994659/
461501626)] |
| |
|
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---+
1 row(s) fetched.
Elapsed 0.010 seconds.
```
##########
datafusion/datasource-parquet/src/reader.rs:
##########
@@ -97,6 +97,7 @@ impl DefaultParquetFileReaderFactory {
pub struct ParquetFileReader {
pub file_metrics: ParquetFileMetrics,
pub inner: ParquetObjectReader,
+ pub partitioned_file: PartitionedFile,
Review Comment:
I think so, but here I think a breaking change is acceptable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]