eason-yuchen-liu opened a new pull request, #55676: URL: https://github.com/apache/spark/pull/55676
### What changes were proposed in this pull request? Introduce a shared `AtomicReference[Throwable]` (the first async write error) that is consulted at the start of every async write task in both `AsyncOffsetSeqLog` and `AsyncCommitLog`. Once the first failure is recorded: - Subsequent offset and commit log write tasks short-circuit by failing with the original error without touching durable storage. - The first error is preserved via `compareAndSet(null, err)` so it is not overwritten by later cascading failures. The shared reference is owned by `AsyncProgressTrackingMicroBatchExecution` and threaded through `AsyncStreamingQueryCheckpointMetadata` to both async logs. The existing `.exceptionally` handlers in `markMicroBatchStart` / `markMicroBatchEnd` are routed through a new `recordAsyncWriteError` helper so the shared ref is also populated when the failure originates from `addAsync`'s own `thenApply` (e.g. `concurrentStreamLogUpdate`). ### Why are the changes needed? When async progress tracking is enabled, offset and commit log writes are submitted to a single-threaded executor service. If one async write task fails, follow-up writes already queued (or queued shortly afterward) can still proceed and persist files to durable storage, leaving inconsistent state on disk — for example, an offset entry for batch N missing while batch N+1 is present, or a commit-log entry written without its corresponding offset-log entry. The original error can also be overwritten in the `ErrorNotifier` by a later cascading failure, so the user-visible exception masks the root cause (e.g. surfacing `concurrentStreamLogUpdate` instead of the actual `Permission denied` / IOException that started the cascade). ### Does this PR introduce _any_ user-facing change? No. Async-progress-tracking queries that hit a write failure will now surface the root cause instead of a later cascading error, but the failure mode (query termination with a `StreamingQueryException`) is unchanged. ### How was this patch tested? Added a regression test in `AsyncProgressTrackingMicroBatchExecutionSuite` that triggers a real I/O failure on the first offset write and verifies: 1. The shared first-error reference is populated. 2. A follow-up commit-log write short-circuits with the original error and produces no commit file. 3. A follow-up offset-log write for the next batch also short-circuits with the same first error. 4. The shared error reference is not overwritten by later cascading failures. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor (Claude Opus 4.7) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
