Yuchen Liu created SPARK-56720:
----------------------------------

             Summary: Fail subsequent async log writes after a prior failure in 
async progress tracking
                 Key: SPARK-56720
                 URL: https://issues.apache.org/jira/browse/SPARK-56720
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 3.4.0
            Reporter: Yuchen Liu


When async progress tracking is enabled, offset and commit log writes are 
submitted to a single-threaded executor in {{AsyncOffsetSeqLog}} / 
{{{}AsyncCommitLog{}}}. If one async write task fails (e.g. an HDFS 
{{Permission denied}} or other {{{}IOException{}}}), follow-up tasks already 
queued — or queued before the main thread re-checks {{errorNotifier}} at the 
next batch boundary — still execute and may successfully persist files to 
durable storage. This produces two correctness/observability problems:
 # Gaps on durable storage. The offset log may be missing batch _N_ while batch 
_N+1_ is present, or a commit-log entry can be written without its 
corresponding offset-log entry. This violates the invariant that the commit log 
is a prefix of the offset log on disk.
 # Root cause is masked. {{ErrorNotifier.markError}} overwrites previously 
stored errors, so a later cascading failure (e.g. 
{{{}concurrentStreamLogUpdate{}}}) can replace the original {{{}Permission 
denied{}}}/{{{}IOException{}}} and surface as the user-visible 
{{StreamingQueryException}} cause.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to