cloud-fan opened a new pull request, #55471:
URL: https://github.com/apache/spark/pull/55471

   ### What changes were proposed in this pull request?
   
   `FilterExec.doConsume` (after SPARK-56032 + SPARK-56431) emits all CSE 
precomputation at the top of the `do { } while (false);` block, before any 
`otherPred` runs. If a common subexpression is used only by a later 
`otherPred`, it is still hoisted ahead of every earlier predicate's `continue;` 
short-circuit. For expressions that throw on invalid input, that turns rows an 
earlier predicate would have rejected into thrown exceptions.
   
   This PR moves each common subexpression's precompute to just before the 
first `otherPred` that references it. Later `otherPred`s still reuse the cached 
value; earlier `otherPred`s pay no cost for subexpressions they do not use. 
`evaluateSubExprEliminationState` already emits each state's children 
recursively and marks emitted code as `EmptyBlock`, so cross-group dependencies 
(a state in group `i` whose child lives in group `j < i`) fold into the earlier 
emission and later emissions of the already-emitted child are no-ops.
   
   With the reorder, binding `otherPreds` against the tightened `output` is 
again safe: all `IsNotNull` short-circuits are emitted upfront, so 
`notNullAttributes` columns are guaranteed non-null by the time any `otherPred` 
(or any CSE precompute) runs. This supersedes the SPARK-56431 `child.output` 
binding workaround.
   
   ### Why are the changes needed?
   
   Regression in throw-prone predicates. For example:
   
   ```sql
   SELECT COUNT(*) FROM t
   WHERE x > 0 AND ((100 / x) > 0 OR (100 / x) < 1000)
   ```
   
   Under ANSI, `100 / 0` throws `DIVIDE_BY_ZERO`. Pre-fix, the hoisted CSE 
precompute of `100 / x` runs for every row before `x > 0` has a chance to 
reject `x = 0`, so the query throws instead of returning a count. Other CSE'd 
throw-prone expressions (`to_utc_timestamp` on extreme timestamps, datetime 
overflow in ANSI mode, geospatial functions on the wrong geometry type in 
downstream consumers, etc.) hit the same pattern.
   
   Other operators with CSE (`ProjectExec`, `HashAggregateExec`, 
`AggregateCodegenSupport`) are unaffected because none of them short-circuit 
between expressions with `continue;` — every input row flows through every 
expression, so hoisting is semantically transparent.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. The generated code is functionally equivalent to the non-CSE path: no 
row has any predicate evaluated that a prior `continue;` would have skipped.
   
   ### How was this patch tested?
   
   - Existing `WholeStageCodegenSuite` "SPARK-56032: subexpression elimination 
in FilterExec codegen" (CSE savings on `a + b`): passes.
   - Existing `WholeStageCodegenSuite` "SPARK-56431: CSE in FilterExec 
preserves null guards for notNull columns" (`Decimal` NPE repro): passes.
   - New `WholeStageCodegenSuite` "SPARK-56032: FilterExec CSE preserves 
short-circuit across predicates": uses `x > 0 AND ((100 / x) > 0 OR (100 / x) < 
1000)` under ANSI; passes with this fix, fails on master without it.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Opus 4.7


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to