asolimando commented on PR #21815:
URL: https://github.com/apache/datafusion/pull/21815#issuecomment-4366943795

   Hey @xudong963, I've pushed new commits implementing what we discussed 
(force-pushed to rebase on latest main, but the first two commits (`f36ef32`, 
`12a2fc1`) are unchanged from the previous push).
   
   A walkthrough of the new commits:
   - `f36ef32` adds `StatisticsContext` parameter to `partition_statistics`, 
keeping the old method as deprecated, as required by the [API health 
guidelines](https://datafusion.apache.org/contributor-guide/api-health.html)
   - `b380893` adds `partition_statistics_with_context` as the new entry point
   - `bb09951` adds `StatsCache` to `StatisticsContext`, shared across the 
entire `compute_statistics` walk
   - `2f843ef` adds a Criterion micro-benchmark on two plan shapes from #19795:
     * CoalescePartitionsExec chain (depth 50): ~25x speedup over 
non-shared-cache baseline
     * CrossJoinExec binary tree (depth 7, 128 leaves): ~3x speedup, mirrors 
`physical_many_self_joins` from `sql_planner.rs`
   - `a8a3d6c` addresses the wasted partition forwarding: 
`compute_statistics_inner` now always pre-computes children with 
`partition=None`. Partition-preserving operators request per-partition stats on 
demand via `compute_child_statistics`, so partition-merging operators use 
`child_stats()` directly instead of triggering re-walks
   
   Re. the benchmark: the numbers are from the average of 5 local runs, and 
they are conservative, as the baseline still benefits from an ephemeral 
per-walk cache within each re-walk, the true baseline would be no caching at 
all, and it would show a larger gap. Since this benchmark is new, I couldn't 
find a better way to show a before/after run. The improvement is clear anyway, 
but I just wanted to mention it for completeness.
   
   Will open a follow-up issue for cross-call caching with stable node IDs 
(Option 2) once this lands, as `StatsCache` exists nowhere at the moment, I am 
afraid it would be confusing if filed now.
   
   Looking forward to your review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to