Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]

2025-06-25 Thread via GitHub


alamb commented on issue #16402:
URL: https://github.com/apache/datafusion/issues/16402#issuecomment-3005902466

   > @alamb do you think we can close this issue and continue in 
https://github.com/apache/datafusion/issues/16555?
   
   Good idea -- thank you for the follow up


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]

2025-06-25 Thread via GitHub


alamb closed issue #16402: Add statistics to ParquetExec for *files* pruned
URL: https://github.com/apache/datafusion/issues/16402


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]

2025-06-25 Thread via GitHub


adriangb commented on issue #16402:
URL: https://github.com/apache/datafusion/issues/16402#issuecomment-3005250264

   I did a bit of investigation.
   
   Using `hits_partitioned` and some massaging I was able to get the expected 
result:
   
   ```sql
   SET datafusion.execution.target_partitions = 1;
   EXPLAIN ANALYZE SELECT "SearchPhrase" FROM 'hits_partitioned' WHERE 
"SearchPhrase" <> '' ORDER BY "SearchPhrase" DESC LIMIT 10;
   
   > EXPLAIN ANALYZE SELECT "SearchPhrase" FROM 'hits_partitioned' WHERE 
"SearchPhrase" <> '' ORDER BY "SearchPhrase" DESC LIMIT 10;
   
+---+--
 -+
   | plan_type | plan   











  
  |
   
+---+--
 -+
   | Plan with Metrics | SortExec: TopK(fetch=10), expr=[SearchPhrase@0 DESC], 
preserve_partitioning=[false], filter=[SearchPhrase@0 IS NULL OR SearchPhrase@0 
> EF83BC09D0B2D0BBD0B0...], metrics=[output_rows=10, 
elapsed_compute=1.284534752s, row_replacements=176] 








  
  |
   |   |   CoalesceBatchesExec: target_batch_size=8192, 
metrics=[output_rows=1197862, elapsed_compute=43.685164ms]  

 

Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]

2025-06-21 Thread via GitHub


adriangb commented on issue #16402:
URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2993688905

   I will look into this next week


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]

2025-06-13 Thread via GitHub


alamb commented on issue #16402:
URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971519773

   > Could it be that in that test we don't have file statistics 
(`datafusion.execution.collect_statistics = false`) -> the pruning is happening 
at the row group level?
   
   ```shell
   set datafusion.execution.collect_statistics = true;
   EXPLAIN ANALYZE SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' 
ORDER BY "SearchPhrase" LIMIT 10;
   ```
   
   And that still doesn't seem to help
   
   Let's wait for https://github.com/apache/datafusion/pull/15770 to merge and 
then I can dig into this more


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]

2025-06-13 Thread via GitHub


alamb commented on issue #16402:
URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971520273

   (basically I want to be able to see from statistics when the dynamic filters 
are helping / not helping)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]

2025-06-13 Thread via GitHub


adriangb commented on issue #16402:
URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971481867

   Could it be that in that test we don't have file statistics 
(`datafusion.execution.collect_statistics = false`) -> the pruning is happening 
at the row group level?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]

2025-06-13 Thread via GitHub


alamb commented on issue #16402:
URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971469382

   > Hmm maybe we aren't including that statistic in the output?
   
   I think everything that is non zero is included. I'll have to look into it 
some more


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]

2025-06-13 Thread via GitHub


adriangb commented on issue #16402:
URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971392142

   > > cover that?
   > 
   > Yes 🤦
   > 
   > For some reason it doesn't show up for me in the explain analyze I have: 
[q25-analyze-topk-dynamic-filter.txt](https://github.com/user-attachments/files/20731770/q25-analyze-topk-dynamic-filter.txt)
   > 
   > Which I made with this command from the 
[#15770](https://github.com/apache/datafusion/pull/15770) branch
   > 
   > $ cat q25-analyze.sql
   > EXPLAIN ANALYZE SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' 
ORDER BY "SearchPhrase" LIMIT 10;
   > ./datafusion-cli-topk-dynamic-filters -f q25-analyze.sql  > 
q25-analyze-topk-dynamic-filter.txt
   > But the query clearly got faster, so I would expect it to be present 🤔
   
   Hmm maybe we aren't including that statistic in the output?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]

2025-06-13 Thread via GitHub


alamb commented on issue #16402:
URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971390765

   > cover that?
   
   Yes 🤦 
   
   For some reason it doesn't show up for me in the explain analyze I have: 
[q25-analyze-topk-dynamic-filter.txt](https://github.com/user-attachments/files/20731770/q25-analyze-topk-dynamic-filter.txt)
   
   Which I made with this command from the 
https://github.com/apache/datafusion/pull/15770 branch
   ```shell
   $ cat q25-analyze.sql
   EXPLAIN ANALYZE SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' 
ORDER BY "SearchPhrase" LIMIT 10;
   ./datafusion-cli-topk-dynamic-filters -f q25-analyze.sql  > 
q25-analyze-topk-dynamic-filter.txt
   ```
   
   But the query clearly got faster, so I would expect it to be present 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]

2025-06-13 Thread via GitHub


adriangb commented on issue #16402:
URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971321730

   Doesn't 
https://github.com/apache/datafusion/blob/4dd6923787084548c9ecc6d90c630c2c28ee9259/datafusion/datasource-parquet/src/metrics.rs#L30-L33
 cover that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]