yaooqinn opened a new pull request, #12183:
URL: https://github.com/apache/gluten/pull/12183
## Motivation
The V2 `serializeWithStats` fast path (opt-in via
`spark.gluten.sql.columnar.tableCache.partitionStats.enabled`) used to catch
only `UnsatisfiedLinkError`. Any `IllegalArgumentException` from
`parseFramedBytes` (corrupt magic, truncated frame, statsLen/bytesLen mismatch)
or `KryoException` from `deserializeStats` would bubble out and **fail the
entire Spark task**. Since V2 is an opt-in optimization, a malformed frame for
one cached batch should not be user-visible — the existing legacy `serialize()`
fallback should absorb it just like an `UnsatisfiedLinkError` does.
## Change
Layered catch in the per-batch fast path:
```scala
case e: UnsatisfiedLinkError =>
// capability gone for the JVM lifetime; trip the latch
markStatsExtUnavailable(e); fallbackToLegacy()
case NonFatal(e) =>
// per-batch corruption; do NOT trip the latch
warnCorruptStatsFrame(e); fallbackToLegacy()
```
The capability latch (`statsExtAvailableFlag`) is deliberately one-way for
the JVM lifetime — a native-symbol mismatch isn't recoverable. Corrupt-frame
events, in contrast, are per-batch and shouldn't poison the latch: the next
batch retries the fast path.
To keep log output bounded when a native regression produces malformed
frames batch after batch, `warnCorruptStatsFrame` caps warnings at 100 per JVM
with a final summary line.
## Refactor
To make the catch site unit-testable (previously nested inside an anonymous
`Iterator` lambda inside `mapPartitions`), the fast path was extracted into a
companion-object helper `serializeOneBatchWithStats(jni, handle, numRows,
structSchema, fallbackToLegacy)`. The first commit is the pure refactor; the
second commit adds the new arm + tests.
`resetStatsExtAvailableForTesting()` is dropped — it had no callers after
the new helper suite switched to resetting the flag via reflection directly.
## Test coverage
New `ColumnarCachedBatchSerializerHelperSuite` (4 cases, Mockito-stubbed JNI
wrapper, no native runtime needed):
- corrupt magic → fallback `CachedBatch` with `stats=null`
- truncated framed bytes → fallback
- Kryo-corrupt statsBlob → fallback
- `UnsatisfiedLinkError` → still trips capability latch (regression)
Local sentinel — all 9 `ColumnarCachedBatch*Suite` green, 46/0 succeeded, 0
regression vs apache/main.
## Risks
- Log cap is hard-coded at 100. Happy to move to a `SQLConf` if reviewers
prefer.
- `NonFatal` deliberately does not absorb `OutOfMemoryError` /
`StackOverflowError` / control-flow exceptions — those should still propagate.
Part of the V2 hardening series following #12176.
Generated-by: Claude claude-opus-4.7
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]