kosiew opened a new pull request, #19476:
URL: https://github.com/apache/datafusion/pull/19476
## Which issue does this PR close?
* Closes #18989.
## Rationale for this change
A panic could occur during `SanityCheckPlan` when optimizing / executing
queries that chain multiple aggregations over a multi-partitioned input (often
empty tables), e.g. `Sort -> Aggregate(ts, region) -> Sort -> Aggregate(ts)`.
In these cases the first aggregation can produce a hash partitioning that is
a **superset** of the second aggregation’s required distribution keys. The
optimizer previously treated some of these situations as not needing a hash
repartition (especially when the input was empty / stats suggested repartition
wasn’t beneficial), which could leave the plan violating the downstream
operator’s distribution requirement and trigger `SanityCheckPlan` failures.
This PR aligns the optimizer’s distribution enforcement decisions with the
same partitioning satisfaction logic used by the sanity checker, ensuring that
required repartitions are inserted whenever the downstream requirement is not
satisfied.
## What changes are included in this PR?
* Introduced `DistributionSatisfactionResult` and a shared
`distribution_satisfaction(...)` helper in `enforce_distribution` to compute
and carry:
* the required `Distribution`
* the child’s `output_partitioning`
* the computed `PartitioningSatisfaction` (Exact / Subset / NotSatisfied)
* a single, reusable `requires_repartition(...)` decision function.
* Updated `add_hash_on_top` and `ensure_distribution` to use the shared
satisfaction result and decision logic, rather than ad-hoc checks.
* Simplified `get_repartition_requirement_status` by removing alignment
logic based on `hash_necessary` flags and instead deferring to the
satisfaction-based repartition decision.
* Updated `SanityCheckPlan` to reuse the same
`distribution_satisfaction(...)` helper (so optimizer and sanity checker
evaluate distribution satisfaction consistently).
* Re-exported `PartitioningSatisfaction` from `datafusion_physical_expr` for
use in tests and downstream code.
* Added targeted regression tests:
* Unit tests verifying `PartitioningSatisfaction` outcomes (Exact / Subset
/ NotSatisfied) match sanity checker behavior for hash partitioning.
* A tokio regression test reproducing the chained-aggregate scenario from
#18989 and asserting the optimized plan either inserts a repartition between
aggregates or preserves a single-partition stream via
`SortPreservingMergeExec`, and executes without panic.
* Updated sqllogictest expected plans where distribution enforcement now
inserts `RepartitionExec` and/or preserves partitioning through `SortExec`.
## Are these changes tested?
Yes.
* Added new unit tests in
`datafusion/core/tests/physical_optimizer/enforce_distribution.rs` to validate
distribution satisfaction behavior and its alignment with `SanityCheckPlan`.
* Added an async regression test covering the multi-partition empty table +
chained aggregations scenario from #18989.
* Updated sqllogictest outputs to reflect the new physical plans.
Performance sanity checks:
* Ran the following benchmarks and verified no performance regressions:
* `cargo bench --bench topk_aggregate --profile=profiling --
--save-baseline before`
* `cargo bench --bench distinct_query_sql --profile=profiling --
--save-baseline before`
* `cargo bench --bench sort_preserving_merge --profile=profiling --
--save-baseline before`
## Are there any user-facing changes?
Potentially.
* Some queries may now include additional `RepartitionExec` operators (or
preserve partitioning through sorts) to correctly satisfy downstream
distribution requirements. This can change the displayed physical plan (EXPLAIN
output) and may affect performance characteristics in edge cases, but fixes a
correctness/stability issue (prevents optimizer-time / execution-time panics).
* No SQL syntax or API behavior changes are intended.
## LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content
has been manually reviewed and tested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]