asolimando commented on PR #21815: URL: https://github.com/apache/datafusion/pull/21815#issuecomment-4366943795
Hey @xudong963, I've pushed new commits implementing what we discussed (force-pushed to rebase on latest main, but the first two commits (`f36ef32`, `12a2fc1`) are unchanged from the previous push). A walkthrough of the new commits: - `f36ef32` adds `StatisticsContext` parameter to `partition_statistics`, keeping the old method as deprecated, as required by the [API health guidelines](https://datafusion.apache.org/contributor-guide/api-health.html) - `b380893` adds `partition_statistics_with_context` as the new entry point - `bb09951` adds `StatsCache` to `StatisticsContext`, shared across the entire `compute_statistics` walk - `2f843ef` adds a Criterion micro-benchmark on two plan shapes from #19795: * CoalescePartitionsExec chain (depth 50): ~25x speedup over non-shared-cache baseline * CrossJoinExec binary tree (depth 7, 128 leaves): ~3x speedup, mirrors `physical_many_self_joins` from `sql_planner.rs` - `a8a3d6c` addresses the wasted partition forwarding: `compute_statistics_inner` now always pre-computes children with `partition=None`. Partition-preserving operators request per-partition stats on demand via `compute_child_statistics`, so partition-merging operators use `child_stats()` directly instead of triggering re-walks Re. the benchmark: the numbers are from the average of 5 local runs, and they are conservative, as the baseline still benefits from an ephemeral per-walk cache within each re-walk, the true baseline would be no caching at all, and it would show a larger gap. Since this benchmark is new, I couldn't find a better way to show a before/after run. The improvement is clear anyway, but I just wanted to mention it for completeness. Will open a follow-up issue for cross-call caching with stable node IDs (Option 2) once this lands, as `StatsCache` exists nowhere at the moment, I am afraid it would be confusing if filed now. Looking forward to your review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
