gengliangwang commented on PR #55889:
URL: https://github.com/apache/spark/pull/55889#issuecomment-4461538263

   Closing this PR — found a regression on `postgreSQL/window_part2.sql` that 
revealed the optimization is fundamentally unsafe. Posting the analysis here 
for the next attempt.
   
   ### The regression
   
   Query #28 expects `CAST_INVALID_INPUT` with `queryContext` at the window 
spec (`startIndex: 83, stopIndex: 163`). With this PR's 
`transformUpWithPruning` change, the `queryContext` widens to the whole query 
(`startIndex: 1, stopIndex: 163`). Same exception, wrong source span.
   
   ### Root cause
   
   The optimization replaces:
   ```scala
   val afterRule = CurrentOrigin.withOrigin(origin) {
     rule.applyOrElse(this, identity[BaseType])
   }
   ```
   with:
   ```scala
   val afterRule = if (rule.isDefinedAt(this)) {
     CurrentOrigin.withOrigin(origin) { rule.apply(this) }
   } else { this }
   ```
   
   The intent: skip `withOrigin`'s `ThreadLocal` set/restore when the rule 
doesn't match. The flaw: `isDefinedAt` is called **outside** `withOrigin`.
   
   For `AnsiCombinedTypeCoercionRule`, the rule is `Function.unlift { e => 
folds-through-sub-rules }`. Scala's `Unlifted.isDefinedAt(a) = f(a).isDefined` 
— it **runs the underlying function**, which constructs new nodes (`Cast(...)`, 
`s.copy(...)`). Each new `TreeNode` captures `CurrentOrigin.get` at 
construction.
   
   `WindowFrameTypeCoercion` constructs `Cast(Literal('NaN'), IntegerType)` 
during this probe. With my change the Cast is constructed *outside* 
`withOrigin`, so its `origin` field captures the outer Project's origin (the 
whole query) instead of the WindowSpec's origin.
   
   Spark eagerly evaluates window frame bound expressions during analysis (they 
must be foldable). `Cast.eval` on `'NaN' → INT` throws 
`SparkNumberFormatException` with `context = cast.origin.context = (1, 163)` — 
the wrong span. The exception escapes `isDefinedAt` carrying the wrong context. 
The user sees an error pointing at the whole query.
   
   ### Trace evidence
   ```
   [TRACE]   this.origin=(82,162)  CurrentOrigin=(0,162)
   [THREW from isDefinedAt] SparkNumberFormatException: CAST_INVALID_INPUT
                            The value 'NaN' of the type "STRING" cannot be cast 
to "INT"
   ```
   
   ### Why catalyst tests didn't catch this
   
   The exception itself is thrown either way — only the `queryContext` attached 
to it differs. Catalyst's 9272 tests check behavior; `SQLQueryTestSuite`'s 
golden files check exact error-message positions.
   
   ### Why this optimization is fundamentally unsafe
   
   Wherever a `PartialFunction.isDefinedAt` has side effects that construct 
`TreeNode`s — and any unlift'd PF guarantees this — the optimization captures 
the wrong `CurrentOrigin` for nodes constructed during the probe. Same 
theoretical risk applies to `transformDownWithPruning` (no failing test 
surfaced, but the structure is identical).
   
   The fix patterns I considered all either:
   - Wrap `isDefinedAt` in `withOrigin` too — defeats the optimization 
(equivalent to original).
   - Use a sentinel default with `applyOrElse` inside `withOrigin` — same.
   - Require PFs to opt in via a marker trait promising pure `isDefinedAt` — 
invasive, doesn't fix `CombinedTypeCoercionRule`.
   
   None gains the perf and stays correct.
   
   ### What's worth keeping from the investigation
   
   JFR profile on a representative `transformDown` (1024-leaf balanced `Add` 
tree, non-matching rule, 60s sample): **~88% of CPU in 
`ThreadLocal$ThreadLocalMap.set` / `getEntryAfterMiss`** inside 
`CurrentOrigin.withOrigin`. There's real money on the floor. Capturing it 
likely requires one of:
   
   1. A different `CurrentOrigin` representation that avoids per-node 
`ThreadLocal` cost.
   2. Moving `CurrentOrigin` propagation off the per-node hot path (set once 
per rule batch).
   3. Refactoring `CombinedTypeCoercionRule` so its `isDefinedAt` is pure — 
making the `isDefinedAt`-then-`apply` pattern safe.
   
   JIRA SPARK-56869 will stay open for the next attempt.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to