Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]
alamb commented on issue #16402: URL: https://github.com/apache/datafusion/issues/16402#issuecomment-3005902466 > @alamb do you think we can close this issue and continue in https://github.com/apache/datafusion/issues/16555? Good idea -- thank you for the follow up -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]
alamb closed issue #16402: Add statistics to ParquetExec for *files* pruned URL: https://github.com/apache/datafusion/issues/16402 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]
adriangb commented on issue #16402: URL: https://github.com/apache/datafusion/issues/16402#issuecomment-3005250264 I did a bit of investigation. Using `hits_partitioned` and some massaging I was able to get the expected result: ```sql SET datafusion.execution.target_partitions = 1; EXPLAIN ANALYZE SELECT "SearchPhrase" FROM 'hits_partitioned' WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" DESC LIMIT 10; > EXPLAIN ANALYZE SELECT "SearchPhrase" FROM 'hits_partitioned' WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" DESC LIMIT 10; +---+-- -+ | plan_type | plan | +---+-- -+ | Plan with Metrics | SortExec: TopK(fetch=10), expr=[SearchPhrase@0 DESC], preserve_partitioning=[false], filter=[SearchPhrase@0 IS NULL OR SearchPhrase@0 > EF83BC09D0B2D0BBD0B0...], metrics=[output_rows=10, elapsed_compute=1.284534752s, row_replacements=176] | | | CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=1197862, elapsed_compute=43.685164ms]
Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]
adriangb commented on issue #16402: URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2993688905 I will look into this next week -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]
alamb commented on issue #16402: URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971519773 > Could it be that in that test we don't have file statistics (`datafusion.execution.collect_statistics = false`) -> the pruning is happening at the row group level? ```shell set datafusion.execution.collect_statistics = true; EXPLAIN ANALYZE SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10; ``` And that still doesn't seem to help Let's wait for https://github.com/apache/datafusion/pull/15770 to merge and then I can dig into this more -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]
alamb commented on issue #16402: URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971520273 (basically I want to be able to see from statistics when the dynamic filters are helping / not helping) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]
adriangb commented on issue #16402: URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971481867 Could it be that in that test we don't have file statistics (`datafusion.execution.collect_statistics = false`) -> the pruning is happening at the row group level? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]
alamb commented on issue #16402: URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971469382 > Hmm maybe we aren't including that statistic in the output? I think everything that is non zero is included. I'll have to look into it some more -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]
adriangb commented on issue #16402: URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971392142 > > cover that? > > Yes 🤦 > > For some reason it doesn't show up for me in the explain analyze I have: [q25-analyze-topk-dynamic-filter.txt](https://github.com/user-attachments/files/20731770/q25-analyze-topk-dynamic-filter.txt) > > Which I made with this command from the [#15770](https://github.com/apache/datafusion/pull/15770) branch > > $ cat q25-analyze.sql > EXPLAIN ANALYZE SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10; > ./datafusion-cli-topk-dynamic-filters -f q25-analyze.sql > q25-analyze-topk-dynamic-filter.txt > But the query clearly got faster, so I would expect it to be present 🤔 Hmm maybe we aren't including that statistic in the output? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]
alamb commented on issue #16402: URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971390765 > cover that? Yes 🤦 For some reason it doesn't show up for me in the explain analyze I have: [q25-analyze-topk-dynamic-filter.txt](https://github.com/user-attachments/files/20731770/q25-analyze-topk-dynamic-filter.txt) Which I made with this command from the https://github.com/apache/datafusion/pull/15770 branch ```shell $ cat q25-analyze.sql EXPLAIN ANALYZE SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10; ./datafusion-cli-topk-dynamic-filters -f q25-analyze.sql > q25-analyze-topk-dynamic-filter.txt ``` But the query clearly got faster, so I would expect it to be present 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]
adriangb commented on issue #16402: URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971321730 Doesn't https://github.com/apache/datafusion/blob/4dd6923787084548c9ecc6d90c630c2c28ee9259/datafusion/datasource-parquet/src/metrics.rs#L30-L33 cover that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
