Rohan Arora created SPARK-57191:
-----------------------------------

             Summary: [YARN] Driver hangs indefinitely when job submission / 
monitor thread fails
                 Key: SPARK-57191
                 URL: https://issues.apache.org/jira/browse/SPARK-57191
             Project: Spark
          Issue Type: Bug
          Components: YARN
    Affects Versions: 4.1.2
            Reporter: Rohan Arora


h4. *Overview*

In Spark-on-YARN client mode deployment, if a fatal uncaught exception is 
thrown within the asynchronous application-submission or application-monitoring 
thread (e.g., during initialisation inside {{YarnClientSchedulerBackend}} or 
YARN {{{}Client.scala{}}}), the Spark Driver process hangs indefinitely instead 
of shutting down or throwing the exception to the main thread.
h4. *Root Cause Analysis*
 # {*}Asynchronous Execution{*}: When YARN client mode starts, 
{{YarnClientSchedulerBackend}} submits the Spark application context to YARN 
and monitors it asynchronously (e.g., utilising the internal {{MonitorThread}} 
or Scala standard {{Future}} contexts).
 # {*}Exception Swallowing/Isolation{*}: If a fatal exception occurs in these 
background threads (such as network failure, credential expiration, or 
{{OutOfMemoryError}} during the initial handshake), the exception is either 
swallowed by Scala {{Future}} execution pools or isolate-trapped in a thread 
not guarded by Spark’s custom {{{}SparkUncaughtExceptionHandler{}}}.
 # {*}Blocker Threads Inactive{*}: Main threads (like the one executing 
{{SparkContext.init}} or {{{}waitForApplication{}}}) remain indefinitely 
blocked waiting on the future completion or lock notification.
 # {*}Zombie JVM State{*}: Since the driver process has already spun up active 
non-daemon threads (such as heartbeats, Spark UI HTTP server, and log 
appenders), the JVM does not exit naturally, leaving the driver in a 
zombie/hung state.

h4. *Impact on Managed Environments*

In orchestration and managed environments (such as cloud platform agents, 
workflows, schedulers), the agent continues to report the job driver process as 
active. The scheduler cannot distinguish this hung driver from a driver 
performing legitimate post-execution cleanup (like metastore synchronization or 
final file renaming). This leads to resource leakages, orphaned driver 
processes, and long job timeout durations for customers.
h4. *Proposed Solution*
 * {*}Exception Propagation{*}: Ensure that worker thread closures and 
background futures executing YARN submissions are wrapped in robust 
{{try-catch}} blocks that propagate exceptions to Spark's uncaught exception 
handler ({{{}ThreadUtils.runInNewThread{}}} should be leveraged for thread 
instantiation).
 * {*}Explicit Teardown on Failure{*}: On critical failures inside the 
submission or monitoring loops, explicitly trigger {{SparkContext.stop()}} or 
standard JVM termination ({{{}System.exit(exitCode){}}}) so that the main 
thread does not block infinitely on states that will never resolve.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to