2010YOUY01 commented on code in PR #18577:
URL: https://github.com/apache/datafusion/pull/18577#discussion_r2509906685


##########
datafusion/datasource-parquet/src/reader.rs:
##########
@@ -129,6 +130,17 @@ impl AsyncFileReader for ParquetFileReader {
     }
 }
 
+impl Drop for ParquetFileReader {
+    fn drop(&mut self) {
+        self.file_metrics
+            .scan_efficiency_ratio
+            .add_part(self.file_metrics.bytes_scanned.value());
+        self.file_metrics
+            .scan_efficiency_ratio
+            .add_total(self.partitioned_file.object_meta.size as usize);

Review Comment:
   I vaguely remember each parquet reader is possible to read only a sub-range 
of a file, if that's the case here we might double count the total size.
   
   I verified with the following queries -- The `l_orderkey` is globally 
ordered, so if we set predicate `l_orderkey=42` then only the first row group 
will be scanned. if we set partition to 2, then there are two parquet reader 
instance, each reading half of the file range, and the resulting 
`scan_efficiency_ratio` should be same even if they're executed under different 
partitioning, now they're different.
   
   Data is generated with 
https://github.com/clflushopt/tpchgen-rs/tree/main/tpchgen-cli, in parquet, sf1
   ```
   -- If target_partitions=1, the ratio is 0.86%
   -- If target_partitions=2, the ratio is 0.43%
   > set datafusion.execution.target_partitions = 2;
   0 row(s) fetched.
   Elapsed 0.000 seconds.
   
   > explain analyze select * from 
'/Users/yongting/data/tpch_sf1/lineitem.parquet' where l_orderkey = 42;
   
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 ---+
   | plan_type         | plan                                                   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                      
    |
   
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 ---+
   | Plan with Metrics | CoalesceBatchesExec: target_batch_size=8192, 
metrics=[output_rows=0, elapsed_compute=376ns, output_bytes=0.0 B]              
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                
    |
   |                   |   FilterExec: l_orderkey@0 = 42, 
metrics=[output_rows=0, elapsed_compute=25.835µs, output_bytes=0.0 B, 
selectivity=0% (0/20096)]                                                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                     
     |
   |                   |     DataSourceExec: file_groups={2 groups: 
[[Users/yongting/data/tpch_sf1/lineitem.parquet:0..115375407], 
[Users/yongting/data/tpch_sf1/lineitem.parquet:115375407..230750813]]}, 
projection=[l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, 
l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, 
l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment], 
file_type=parquet, predicate=l_orderkey@0 = 42, 
pruning_predicate=l_orderkey_null_count@2 != row_count@3 AND l_orderkey_min@0 
<= 42 AND 42 <= l_orderkey_max@1, required_guarantees=[l_orderkey in (42)], 
metrics=[output_rows=20096, elapsed_compute=2ns, output_bytes=6.6 MB, 
files_ranges_pruned_statistics=2 total → 2 matched, 
row_groups_pruned_statistics=49 total → 1 matched, 
row_groups_pruned_bloom_filter=1 total → 1 matched, 
page_index_rows_pruned=123010 total → 20096 matched, bytes_scanned=1994659, 
metadata_load_time=123.46µs, scan_efficiency_ratio=0.43% (1994659/
 461501626)] |
   |                   |                                                        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                      
    |
   
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 ---+
   1 row(s) fetched.
   Elapsed 0.010 seconds.
   ```



##########
datafusion/datasource-parquet/src/reader.rs:
##########
@@ -97,6 +97,7 @@ impl DefaultParquetFileReaderFactory {
 pub struct ParquetFileReader {
     pub file_metrics: ParquetFileMetrics,
     pub inner: ParquetObjectReader,
+    pub partitioned_file: PartitionedFile,

Review Comment:
   I think so, but here I think a breaking change is acceptable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to