leoyvens commented on issue #10336: URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2246022064
One thing I've noticed is that after DataFusion 40 this actually works in my use case, likely thanks to the statistics code getting fixed, so good news there! It does require additionally setting `execution.collect_statistics = true`, which makes sense. However for my entirely sorted and non-overlapping dataset it did make Parquet scanning single-threaded (`ParquetScan` with a single file group), which is a big performance regression. So it didn't really help me, maybe I actually want #10316. The consequence to this issue being that turning this on by default would regress performance for users that have `execution.collect_statistics = true`. Maybe the flag should be merged with `prefer_existing_sort`, which has the semantics of avoiding sorts at the cost of limiting parallelism. Or maybe just wait for #10316, so we can both avoid the sort and still have a parallel `ParquetExec`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
