2010YOUY01 opened a new issue, #18355:
URL: https://github.com/apache/datafusion/issues/18355

   ### Is your feature request related to a problem or challenge?
   
   Follow up to https://github.com/apache/datafusion/pull/18321
   Original discussion 
https://github.com/apache/datafusion/pull/18321#issuecomment-3459671363
   
   ### Background
   For each row group, the parquet scanner will try to prune it in the 
following order
   1. Check if this row group can be pruned by statistics (e.g. column a has 
statistics `min=1, max=10`, the predicate in the query is asking for rows that 
`a>15`, so we can skip the whole row group)
   2. Check if this row group can be pruned using bloom filter, similarly.
   Metrics can be used to check the pruning result.
   
   ### Checking Metrics
   In `datafusion-cli`, run
   ```
   CREATE EXTERNAL TABLE IF NOT EXISTS lineitem
   STORED AS parquet
   LOCATION '/Users/yongting/Code/datafusion/benchmarks/data/tpch_sf1/lineitem';
   
   set datafusion.explain.analyze_level = summary;
   
   explain analyze select *
   from lineitem
   where l_orderkey = 3000000;
   ```
   
   And you will get the parquet metrics
   ```
   DataSourceExec: ...metrics=[... row_groups_pruned_statistics=1 total → 1 
matched,row_groups_pruned_bloom_filter=1 total → 1 matched,...]
   ```
   
   `row_groups_pruned_statistics=1 total → 1` means we start with 1 row group, 
and it has checked stat, and it can't be pruned
   `row_groups_pruned_bloom_filter=1 total → 1 matched` means there is no bloom 
filter available, so we can't skip it either, 1 matched row group will continue 
to do further check
   
   Note: the parquet table is generated using the setup in `benchmark/`, and we 
can use https://parquet-viewer.xiangpeng.systems/ to check the availability of 
the bloom filters
   
   ### Issue
   `row_groups_pruned_bloom_filter=1 total → 1 matched` is ambiguous, we don't 
know if it has checked the bloom filter and find it can't be pruned, or the 
bloom filter is not available.
   A better way to display is: if bf is unavailable, don't display this metric.
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to