2010YOUY01 opened a new issue, #18355: URL: https://github.com/apache/datafusion/issues/18355
### Is your feature request related to a problem or challenge? Follow up to https://github.com/apache/datafusion/pull/18321 Original discussion https://github.com/apache/datafusion/pull/18321#issuecomment-3459671363 ### Background For each row group, the parquet scanner will try to prune it in the following order 1. Check if this row group can be pruned by statistics (e.g. column a has statistics `min=1, max=10`, the predicate in the query is asking for rows that `a>15`, so we can skip the whole row group) 2. Check if this row group can be pruned using bloom filter, similarly. Metrics can be used to check the pruning result. ### Checking Metrics In `datafusion-cli`, run ``` CREATE EXTERNAL TABLE IF NOT EXISTS lineitem STORED AS parquet LOCATION '/Users/yongting/Code/datafusion/benchmarks/data/tpch_sf1/lineitem'; set datafusion.explain.analyze_level = summary; explain analyze select * from lineitem where l_orderkey = 3000000; ``` And you will get the parquet metrics ``` DataSourceExec: ...metrics=[... row_groups_pruned_statistics=1 total → 1 matched,row_groups_pruned_bloom_filter=1 total → 1 matched,...] ``` `row_groups_pruned_statistics=1 total → 1` means we start with 1 row group, and it has checked stat, and it can't be pruned `row_groups_pruned_bloom_filter=1 total → 1 matched` means there is no bloom filter available, so we can't skip it either, 1 matched row group will continue to do further check Note: the parquet table is generated using the setup in `benchmark/`, and we can use https://parquet-viewer.xiangpeng.systems/ to check the availability of the bloom filters ### Issue `row_groups_pruned_bloom_filter=1 total → 1 matched` is ambiguous, we don't know if it has checked the bloom filter and find it can't be pruned, or the bloom filter is not available. A better way to display is: if bf is unavailable, don't display this metric. ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
