gengliangwang commented on PR #55889:
URL: https://github.com/apache/spark/pull/55889#issuecomment-4461538263
Closing this PR — found a regression on `postgreSQL/window_part2.sql` that
revealed the optimization is fundamentally unsafe. Posting the analysis here
for the next attempt.
### The regression
Query #28 expects `CAST_INVALID_INPUT` with `queryContext` at the window
spec (`startIndex: 83, stopIndex: 163`). With this PR's
`transformUpWithPruning` change, the `queryContext` widens to the whole query
(`startIndex: 1, stopIndex: 163`). Same exception, wrong source span.
### Root cause
The optimization replaces:
```scala
val afterRule = CurrentOrigin.withOrigin(origin) {
rule.applyOrElse(this, identity[BaseType])
}
```
with:
```scala
val afterRule = if (rule.isDefinedAt(this)) {
CurrentOrigin.withOrigin(origin) { rule.apply(this) }
} else { this }
```
The intent: skip `withOrigin`'s `ThreadLocal` set/restore when the rule
doesn't match. The flaw: `isDefinedAt` is called **outside** `withOrigin`.
For `AnsiCombinedTypeCoercionRule`, the rule is `Function.unlift { e =>
folds-through-sub-rules }`. Scala's `Unlifted.isDefinedAt(a) = f(a).isDefined`
— it **runs the underlying function**, which constructs new nodes (`Cast(...)`,
`s.copy(...)`). Each new `TreeNode` captures `CurrentOrigin.get` at
construction.
`WindowFrameTypeCoercion` constructs `Cast(Literal('NaN'), IntegerType)`
during this probe. With my change the Cast is constructed *outside*
`withOrigin`, so its `origin` field captures the outer Project's origin (the
whole query) instead of the WindowSpec's origin.
Spark eagerly evaluates window frame bound expressions during analysis (they
must be foldable). `Cast.eval` on `'NaN' → INT` throws
`SparkNumberFormatException` with `context = cast.origin.context = (1, 163)` —
the wrong span. The exception escapes `isDefinedAt` carrying the wrong context.
The user sees an error pointing at the whole query.
### Trace evidence
```
[TRACE] this.origin=(82,162) CurrentOrigin=(0,162)
[THREW from isDefinedAt] SparkNumberFormatException: CAST_INVALID_INPUT
The value 'NaN' of the type "STRING" cannot be cast
to "INT"
```
### Why catalyst tests didn't catch this
The exception itself is thrown either way — only the `queryContext` attached
to it differs. Catalyst's 9272 tests check behavior; `SQLQueryTestSuite`'s
golden files check exact error-message positions.
### Why this optimization is fundamentally unsafe
Wherever a `PartialFunction.isDefinedAt` has side effects that construct
`TreeNode`s — and any unlift'd PF guarantees this — the optimization captures
the wrong `CurrentOrigin` for nodes constructed during the probe. Same
theoretical risk applies to `transformDownWithPruning` (no failing test
surfaced, but the structure is identical).
The fix patterns I considered all either:
- Wrap `isDefinedAt` in `withOrigin` too — defeats the optimization
(equivalent to original).
- Use a sentinel default with `applyOrElse` inside `withOrigin` — same.
- Require PFs to opt in via a marker trait promising pure `isDefinedAt` —
invasive, doesn't fix `CombinedTypeCoercionRule`.
None gains the perf and stays correct.
### What's worth keeping from the investigation
JFR profile on a representative `transformDown` (1024-leaf balanced `Add`
tree, non-matching rule, 60s sample): **~88% of CPU in
`ThreadLocal$ThreadLocalMap.set` / `getEntryAfterMiss`** inside
`CurrentOrigin.withOrigin`. There's real money on the floor. Capturing it
likely requires one of:
1. A different `CurrentOrigin` representation that avoids per-node
`ThreadLocal` cost.
2. Moving `CurrentOrigin` propagation off the per-node hot path (set once
per rule batch).
3. Refactoring `CombinedTypeCoercionRule` so its `isDefinedAt` is pure —
making the `isDefinedAt`-then-`apply` pattern safe.
JIRA SPARK-56869 will stay open for the next attempt.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]