2010YOUY01 opened a new issue, #18116: URL: https://github.com/apache/datafusion/issues/18116
### Is your feature request related to a problem or challenge? Previously, by default, `EXPLAIN ANALYZE` will show all available metrics inside an operator. It can get quite verbose for some operator. In `datafusion-cli`: ``` > CREATE EXTERNAL TABLE IF NOT EXISTS lineitem STORED AS parquet LOCATION '/Users/yongting/Code/datafusion/benchmarks/data/tpch_sf1/lineitem'; 0 row(s) fetched. Elapsed 0.000 seconds. > explain analyze select * from lineitem where l_orderkey = 3000000; ``` The parquet reader includes a large number of low-level details: ``` metrics=[output_rows=19813, elapsed_compute=14ns, batches_split=0, bytes_scanned=2147308, file_open_errors=0, file_scan_errors=0, files_ranges_pruned_statistics=18, num_predicate_creation_errors=0, page_index_rows_matched=19813, page_index_rows_pruned=729088, predicate_cache_inner_records=0, predicate_cache_records=0, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=1, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, bloom_filter_eval_time=21.997µs, metadata_load_time=273.83µs, page_index_eval_time=29.915µs, row_pushdown_eval_time=42ns, statistics_eval_time=76.248µs, time_elapsed_opening=4.02146ms, time_elapsed_processing=24.787461ms, time_elapsed_scanning_total=24.17671ms, time_elapsed_scanning_until_data=23.103665ms] ``` I believe only a subset of it is commonly used, for example output_rows, metadata_load_time, and how many file/row-group/pages are pruned, and it would better to only display the most common ones by default. After https://github.com/apache/datafusion/pull/18098, the `EXPLAIN ANALYZE` detail level can be controlled through an option ``` > set datafusion.explain.analyze_level = summary; 0 row(s) fetched. Elapsed 0.000 seconds. > explain analyze select * from generate_series(10000) as t1(v1) order by v1 desc; +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Plan with Metrics | SortExec: expr=[v1@0 DESC], preserve_partitioning=[false], metrics=[output_rows=10001, elapsed_compute=100µs] | | | ProjectionExec: expr=[value@0 as v1], metrics=[output_rows=10001, elapsed_compute=1.166µs] | | | LazyMemoryExec: partitions=1, batch_generators=[generate_series: start=0, end=10000, batch_size=8192], metrics=[output_rows=10001, elapsed_compute=43µs] | | | | +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row(s) fetched. Elapsed 0.001 seconds. > set datafusion.explain.analyze_level = dev; 0 row(s) fetched. Elapsed 0.000 seconds. > explain analyze select * from generate_series(10000) as t1(v1) order by v1 desc; +-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Plan with Metrics | SortExec: expr=[v1@0 DESC], preserve_partitioning=[false], metrics=[output_rows=10001, elapsed_compute=222.043µs, spill_count=0, spilled_bytes=0.0 B, spilled_rows=0, batches_split=2] | | | ProjectionExec: expr=[value@0 as v1], metrics=[output_rows=10001, elapsed_compute=2.584µs] | | | LazyMemoryExec: partitions=1, batch_generators=[generate_series: start=0, end=10000, batch_size=8192], metrics=[output_rows=10001, elapsed_compute=162.625µs] | | | | +-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row(s) fetched. Elapsed 0.003 seconds. ``` Now only `output_rows` and `elapsed_compute` are included in the `summary` analyze level, and the `dev` level is the default. The goal of this issue is, for each operator: - Investigate what's the most commonly used metrics for high-level insights, postgres and DuckDB's result can be used as reference - Set those common metrics to `summary` type And finally set `summary` analyze level to default. ### Describe the solution you'd like Identify which metrics should be included in the `summary` level for all operators. - [ ] DataSourceExec - [ ] DataSinkExec - [ ] AnalyzeExec - [ ] AsyncFuncExec - [ ] AggregateExec - [ ] CoalesceBatchesExec - [ ] CoalescePartitionsExec - [ ] CooperativeExec - [ ] CrossJoinExec - [ ] EmptyExec - [ ] ExplainExec - [ ] FilterExec - [ ] GlobalLimitExec - [ ] LocalLimitExec - [ ] HashJoinExec - [ ] NestedLoopJoinExec - [ ] SortMergeJoinExec - [ ] SymmetricHashJoinExec - [ ] PlaceholderRowExec - [ ] ProjectionExec - [ ] RecursiveQueryExec - [ ] RepartitionExec - [ ] SortExec - [ ] PartialSortExec - [ ] SortPreservingMergeExec - [ ] StreamingTableExec - [ ] UnionExec - [ ] InterleaveExec - [ ] UnnestExec - [ ] WindowAggExec - [ ] BoundedWindowAggExec - [ ] WorkTableExec - [ ] LazyMemoryExec ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
