[PR] [SPARK-40193][SQL][FOLLOWUP] Restrict cached-side If wrapping to original cached range [spark]

via GitHub Wed, 22 Apr 2026 22:35:33 -0700


cloud-fan opened a new pull request, #55500:
URL: https://github.com/apache/spark/pull/55500


   ### What changes were proposed in this pull request?
   
   Follow-up to #55298 (SPARK-40193). Two related cleanups to `PlanMerger`'s 
filter propagation:
   
   1. **Correctness fix in `mergeNamedExpressions`.** Wrapping of unmatched 
cached expressions with the cached plan's filter now iterates only over the 
original cached range `[0, cachedPlanExpressions.length)`, not over all of 
`mergedExpressions`. The previous loop also touched new-plan entries that were 
appended earlier in the same call and already wrapped with the new plan's 
filter.
   
   2. **Tighten the `(np: Filter, cp)` / `(np, cp: Filter)` cases in 
`tryMergePlans`.** Drop the structurally unreachable branches that appended 
`cpFilter.toSeq` / `npFilter.map(_._1).toSeq` to the new `Project` and the 
corresponding `symmetricFilterPropagationEnabled` escape in the guard. In both 
cases the recursion keeps the non-Filter side unchanged, so no deeper case can 
expose a `Filter` on that side — the child result always has `cpFilter = None` 
/ `npFilter = None`. Matching `None` explicitly makes the invariant explicit 
and removes dead code that would have produced a `Project` with duplicate 
attributes if ever reached.
   
   ### Why are the changes needed?
   
   For (1): with symmetric filter propagation enabled 
(`spark.sql.optimizer.mergeSubplans.symmetricFilterPropagation.enabled = true`) 
and non-attribute `Project` expressions on both sides of the merge, the 
cached-side loop double-wrapped new-plan-appended expressions with 
`If(cpFilter, If(npFilter, expr, null), null)` and replaced the slot in 
`mergedExpressions` with a new `Alias` (fresh `exprId`). The `newNPMapping` 
built earlier in the same call still pointed at the single-wrap alias's 
attribute, so the parent `Aggregate` was rewritten to reference an attribute 
that was no longer in the merged `Project`'s output. The resulting plan failed 
analysis with `MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION`.
   
   Minimal reproducer (fails on master before this PR):
   
   ```scala
   withSQLConf(SQLConf.MERGE_SUBPLANS_SYMMETRIC_FILTER_PROPAGATION_ENABLED.key 
-> "true") {
     val subquery1 = ScalarSubquery(
       testRelation.where($"a" > 1).select(($"a" * 
2).as("x")).groupBy()(sum($"x").as("sum_x")))
     val subquery2 = ScalarSubquery(
       testRelation.where($"a" < 1).select(($"a" + 
1).as("y")).groupBy()(max($"y").as("max_y")))
     val df = testRelation.select(subquery1, subquery2).analyze
     MergeSubplans(df)  // analyzer error: Resolved attribute(s) "y" missing 
from "x", "y", ...
   }
   ```
   
   For (2): the branches in question are unreachable by case analysis and the 
appended `cpFilter.toSeq` / `npFilter.map(_._1).toSeq` would duplicate an 
attribute already present in `mergedChild.output`. Removing them makes the 
reachable contract explicit.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. The bug was only observable as an analyzer failure, and only when 
`spark.sql.optimizer.mergeSubplans.symmetricFilterPropagation.enabled` (which 
defaults to `false`) was enabled together with subqueries whose merge path 
exercises non-attribute `Project` expressions on both sides. Behavior otherwise 
matches the released master.
   
   ### How was this patch tested?
   
   - New unit test `MergeSubplansSuite`: `"SPARK-40193: Merge non-grouping 
subqueries with different filter conditions and non-attribute Project 
expressions on both sides"` — fails on master without the fix (analysis error), 
passes with the fix.
   - Full `MergeSubplansSuite` (42 tests) and `PlanMergeSuite` (12 tests) 
continue to pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Sonnet 4.5
   
   This pull request and its description were written by Isaac.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-40193][SQL][FOLLOWUP] Restrict cached-side If wrapping to original cached range [spark]

Reply via email to