adriangb opened a new pull request, #22439: URL: https://github.com/apache/datafusion/pull/22439
## Summary `repartition_file_min_size` gates how aggressively `repartitioned()` splits file groups by byte range to fan a scan out across `target_partitions` worth of cores. At 10 MiB the default leaves several SF1-sized dimension tables (TPC-H \`part\` ≈ 24 MiB, TPC-DS \`customer_address\` ≈ 7 MiB, …) on a single partition, so any CPU-bound per-batch work in the scan (filter eval, dictionary expansion, etc.) is single-threaded even when the cluster has plenty of idle cores. At 1 MiB those same files split cleanly into \`target_partitions\` byte ranges. The cost (more \`open()\` calls, more metadata loads) is small in absolute terms (≤10 extra opens per file in the worst case, each amortised over the row-group / page-index reads) and the existing knob is still available for workloads where it matters. ## Benchmark numbers 12-core, SF1, with the existing dynamic-filter-pushdown defaults preserved: | Suite | default (10 MiB) | with this PR (1 MiB) | |---|---|---| | TPC-H total | 841 ms | 776 ms | | TPC-H Q22 | ~30 ms | ~17 ms | | TPC-DS total | 11.0 s | 11.1 s | | ClickBench total | 21.7 s | 19.0 s | ## Test plan - [x] \`cargo test --test sqllogictests\` — all 472 files pass after the information_schema snapshot and a csv_files reset. - [ ] \`run benchmarks\` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
