Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]

via GitHub Tue, 23 Jul 2024 11:57:58 -0700


leoyvens commented on issue #10336:
URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2246022064


   One thing I've noticed is that after DataFusion 40 this actually works in my 
use case, likely thanks to the statistics code getting fixed, so good news 
there! It does require additionally setting `execution.collect_statistics = 
true`, which makes sense.
   
   However for my entirely sorted and non-overlapping dataset it did make 
Parquet scanning single-threaded (`ParquetScan` with a single file group), 
which is a big performance regression. So it didn't really help me, maybe I 
actually want #10316.
   
   The consequence to this issue being that turning this on by default would 
regress performance for users that have `execution.collect_statistics = true`. 
Maybe the flag should be merged with `prefer_existing_sort`, which has the 
semantics of avoiding sorts at the cost of limiting parallelism. Or maybe just 
wait for #10316, so we can both avoid the sort and still have a parallel 
`ParquetExec`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]

Reply via email to