Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]
alamb commented on issue #10336: URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2256432095 Sorry for the delay @leoyvens and thank you for this analysis > https://github.com/apache/datafusion/issues/11170 I would personally love to take this approach -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]
leoyvens commented on issue #10336: URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2246022064 One thing I've noticed is that after DataFusion 40 this actually works in my use case, likely thanks to the statistics code getting fixed, so good news there! It does require additionally setting `execution.collect_statistics = true`, which makes sense. However for my entirely sorted and non-overlapping dataset it did make Parquet scanning single-threaded (`ParquetScan` with a single file group), which is a big performance regression. So it didn't really help me, maybe I actually want #10316. The consequence to this issue being that turning this on by default would regress performance for users that have `execution.collect_statistics = true`. Maybe the flag should be merged with `prefer_existing_sort`, which has the semantics of avoiding sorts at the cost of limiting parallelism. Or maybe just wait for #10316, so we can both avoid the sort and still have a parallel `ParquetExec`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]
alamb commented on issue #10336: URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2094127979 THank you @yyy1000 🙏 I think a good place to start would be to write some sqllogic level tests to cover the important cases Perhaos for the first test: 1. Create files: file1.parquet, file2.parquet both sorted on `a` but file 1 has the columns in the order `a, b, c` and file has the columns in the order `c, b, a`. The keyranges of values of a should be non overlapping 2. Create an external table `a, b, c` with explicit order by `a,` and then query `SELECT ... ORDER BY a` and make sure the output plan doesn't use sort preserving merge I think we could extend datafusion/sqllogictest/test_files/parquet_sorted_statistics.slt -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]
yyy1000 commented on issue #10336: URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2093968410 I'd like to help it. 🙌 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]
alamb commented on issue #10336: URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2089121776 Example test coverage we should add I think: https://github.com/apache/datafusion/pull/9593#discussion_r1585517605 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Enable `split_file_groups_by_statistics` by default [datafusion]
alamb opened a new issue, #10336: URL: https://github.com/apache/datafusion/issues/10336 ### Is your feature request related to a problem or challenge? Part of https://github.com/apache/datafusion/issues/10313 In https://github.com/apache/datafusion/pull/9593, @suremarc added a way to reorganize input files in a ListingTable to avoid a merge, if the sort key ranges do not overlap This feature is behind a feature flag, `split_file_groups_by_statistics` which defaults to `false` as I think there needs to be some more tests in place before we turn it on ### Describe the solution you'd like Add additional tests and then enable `split_file_groups_by_statistics` by default ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org