[ 
https://issues.apache.org/jira/browse/SPARK-57191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-57191.
------------------------------------
    Fix Version/s: 4.2.0
         Assignee: Shrirang Mhalgi
       Resolution: Fixed

Issue resolved by https://github.com/apache/spark/pull/56274

> [YARN] Driver hangs indefinitely when job submission / monitor thread fails
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-57191
>                 URL: https://issues.apache.org/jira/browse/SPARK-57191
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 4.1.2
>            Reporter: Rohan Arora
>            Assignee: Shrirang Mhalgi
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.2.0
>
>
> h4. *Overview*
> In Spark-on-YARN client mode deployment, if a fatal uncaught exception is 
> thrown within the asynchronous application-submission or 
> application-monitoring thread (e.g., during initialisation inside 
> {{YarnClientSchedulerBackend}} or YARN {{{}Client.scala{}}}), the Spark 
> Driver process hangs indefinitely instead of shutting down or throwing the 
> exception to the main thread.
> h4. *Root Cause Analysis*
>  # {*}Asynchronous Execution{*}: When YARN client mode starts, 
> {{YarnClientSchedulerBackend}} submits the Spark application context to YARN 
> and monitors it asynchronously (e.g., utilising the internal 
> {{MonitorThread}} or Scala standard {{Future}} contexts).
>  # {*}Exception Swallowing/Isolation{*}: If a fatal exception occurs in these 
> background threads (such as network failure, credential expiration, or 
> {{OutOfMemoryError}} during the initial handshake), the exception is either 
> swallowed by Scala {{Future}} execution pools or isolate-trapped in a thread 
> not guarded by Spark’s custom {{{}SparkUncaughtExceptionHandler{}}}.
>  # {*}Blocker Threads Inactive{*}: Main threads (like the one executing 
> {{SparkContext.init}} or {{{}waitForApplication{}}}) remain indefinitely 
> blocked waiting on the future completion or lock notification.
>  # {*}Zombie JVM State{*}: Since the driver process has already spun up 
> active non-daemon threads (such as heartbeats, Spark UI HTTP server, and log 
> appenders), the JVM does not exit naturally, leaving the driver in a 
> zombie/hung state.
> h4. *Impact on Managed Environments*
> In orchestration and managed environments (such as cloud platform agents, 
> workflows, schedulers), the agent continues to report the job driver process 
> as active. The scheduler cannot distinguish this hung driver from a driver 
> performing legitimate post-execution cleanup (like metastore synchronization 
> or final file renaming). This leads to resource leakages, orphaned driver 
> processes, and long job timeout durations for customers.
> h4. *Proposed Solution*
>  * {*}Exception Propagation{*}: Ensure that worker thread closures and 
> background futures executing YARN submissions are wrapped in robust 
> {{try-catch}} blocks that propagate exceptions to Spark's uncaught exception 
> handler ({{{}ThreadUtils.runInNewThread{}}} should be leveraged for thread 
> instantiation).
>  * {*}Explicit Teardown on Failure{*}: On critical failures inside the 
> submission or monitoring loops, explicitly trigger {{SparkContext.stop()}} or 
> standard JVM termination ({{{}System.exit(exitCode){}}}) so that the main 
> thread does not block infinitely on states that will never resolve.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to