cloud-fan opened a new pull request, #56054: URL: https://github.com/apache/spark/pull/56054
### What changes were proposed in this pull request? Followup to https://github.com/apache/spark/pull/54373. SPARK-55601 added a new `offsetLog.getLatest()` call inside `logicalPlan`'s computation to derive `enforceNamed` from the last written offset log entry. `initializeExecution` already calls `offsetLog.getLatest()` on its first line. Both calls happen on the query thread during stream startup, with no offset log writes in between, so the two reads always return the same value. The second one is wasted work: each `getLatest()` triggers `listBatches` → `HDFSMetadataLog.list` → a filesystem `ListStatus` on the checkpoint's `offsets/` directory. This PR caches the first read in a `private lazy val initialLatestOffsetSeq` on `MicroBatchExecution` and routes both call sites through it: - `enforceNamed` derivation in `logicalPlan` lazy val. - `var latestStartedBatch` initialization in `initializeExecution`. Subsequent reads inside `initializeExecution` (after a `purgeAfter`) and in `populateStartOffsets` are unchanged — those legitimately need fresh `getLatest()` results. ### Why are the changes needed? Avoids one redundant `ListStatus` on `<checkpoint>/offsets/` per stream startup. The cost is small but unnecessary, and downstream consumers that track per-checkpoint filesystem operations (for tracing, auditing, or test invariants) currently see one extra op against the offsets directory because of this duplication. ### Does this PR introduce _any_ user-facing change? No. Same behavior, fewer filesystem calls. ### How was this patch tested? Existing `MicroBatchExecutionSuite` and downstream streaming-startup tests cover both call sites. The change is a pure caching refactor; the cached value is identical to what a second `getLatest()` would return because nothing else writes the offset log between construction and `initializeExecution` on the query thread. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
