peter-toth opened a new pull request, #55659: URL: https://github.com/apache/spark/pull/55659
### What changes were proposed in this pull request? Four bug fixes and two small cleanups in `PlanMerger`: Bug fixes in `PlanMerger`: 1. Tagged `(Filter, Filter)` reuse preserves `mergedChild`'s appended columns: When the reuse check finds an existing `propagatedFilter` alias, the branch now rebuilds the Filter over `mergedChild` (via `cp.withNewChildren(Seq(mergedChild))`) instead of returning `cp` unchanged. If the recursion extended `cp.child`'s output with new columns (e.g. a computed `d = a + b` from a user Project below the Filter), returning `cp` would drop those columns while `npMapping` still pointed into them, leaving the enclosing `Aggregate` with unresolved references. 2. `(np: Filter, cp)` create-new does not re-append `cpFilter`: `cpFilter`, when set, was produced by a deeper `(np, cp: Filter)` (or `(Join, Join)` pass-through) and is already part of `mergedChild`'s output. Appending it a second time via `++ cpFilter.toSeq` duplicated the attribute in the outer Project's projectList. 3. `(np, cp: Filter)` create-new does not re-append `npFilter`: Symmetric to 2. on the np side. 4. `(np, cp: Filter)` with a `MERGED_FILTER_TAG`-tagged `cp` drops the tagged Filter: `cp`'s condition is `OR(pf_0, pf_1, ...)` and `cp`'s aggregate expressions already carry individual `FILTER (WHERE pf_i)` clauses. Synthesising a new `propagatedFilter_X = OR(pf_0, pf_1, ...)` would just add `FILTER AND(OR(...), pf_i)` wrapping upstream (simplifying to `FILTER pf_i`) plus a redundant alias in the Project. The branch now drops `cp`'s Filter and returns `cpFilter = None` so `cp`'s aggregates are left untouched. Cleanups in `PlanMerger.merge()`: - Unify the local variable name to `newMergedPlan` across all three branches (was `newMergedPlan` in one and `newMergePlan` in the other two) -- matches the `MergedPlan` case class name. - Replace `cache(i).merged` with `mp.merged`; `mp` and `cache(i)` are the same object inside the `collectFirst` pattern. ### Why are the changes needed? Fix 1. is a correctness bug. Fixes 2.-4. are plan-shape bugs that produce duplicated attributes or redundant `OR`-of-propagated-filter aliases in the merged plan. The cleanups are minor readability improvements. ### Does this PR introduce _any_ user-facing change? No. All changes are internal to the optimizer; they produce cleaner merged plans for queries that `MergeSubplans` already handled. ### How was this patch tested? Four new tests in `MergeSubplansSuite`, one per fix: - `(np: Filter, cp)` create-new must not re-append cpFilter into the Project -- exercises 2. via a `Join` with a `Filter` on the right child, routing a cpFilter up through `(Join, Join)` so that `mergedChild.output` already contains the attribute the branch used to re-append. - `(np, cp: Filter)` create-new must not re-append npFilter into the Project -- exercises 3., mirror shape on the np side. - tagged `(Filter, Filter)` reuse must keep mergedChild's appended columns -- exercises 1. with three subqueries (sq1/sq2 create the tagged structure; sq3's Filter sits above a user Project introducing `d = a + b`, so the `(Filter, Filter)` tagged recursion extends `mergedChild` with `d`). - `(np, cp: Filter)` drops a tagged cp Filter without synthesising a redundant alias -- exercises 4. with three subqueries (sq1/sq2 create the tagged structure; sq3 has no filter). ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
