Yuchen Liu created SPARK-56720:
----------------------------------
Summary: Fail subsequent async log writes after a prior failure in
async progress tracking
Key: SPARK-56720
URL: https://issues.apache.org/jira/browse/SPARK-56720
Project: Spark
Issue Type: Bug
Components: Structured Streaming
Affects Versions: 3.4.0
Reporter: Yuchen Liu
When async progress tracking is enabled, offset and commit log writes are
submitted to a single-threaded executor in {{AsyncOffsetSeqLog}} /
{{{}AsyncCommitLog{}}}. If one async write task fails (e.g. an HDFS
{{Permission denied}} or other {{{}IOException{}}}), follow-up tasks already
queued — or queued before the main thread re-checks {{errorNotifier}} at the
next batch boundary — still execute and may successfully persist files to
durable storage. This produces two correctness/observability problems:
# Gaps on durable storage. The offset log may be missing batch _N_ while batch
_N+1_ is present, or a commit-log entry can be written without its
corresponding offset-log entry. This violates the invariant that the commit log
is a prefix of the offset log on disk.
# Root cause is masked. {{ErrorNotifier.markError}} overwrites previously
stored errors, so a later cascading failure (e.g.
{{{}concurrentStreamLogUpdate{}}}) can replace the original {{{}Permission
denied{}}}/{{{}IOException{}}} and surface as the user-visible
{{StreamingQueryException}} cause.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]