[ 
https://issues.apache.org/jira/browse/SPARK-41483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deng An updated SPARK-41483:
----------------------------
    Description: 
My issue is similar to: SPARK-31625( 
[https://github.com/apache/spark/pull/28435).]

In the scenario where the shutdown hook does not run (e.g., timeouts, etc.), 
the application is not unregistered, resulting in YARN RM resubmitting the 
application even if it succeeded.
{code:java}
22/12/08 09:28:06 INFO ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0
22/12/08 09:28:06 INFO SparkContext :Invoking stop() from shut down hook 
22/12/08 09:28:06 INFO SparkContext :SparklJI : Stopped Spark web UI at xxx
22/12/08 09:28:16 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException     
at java.util.concurrent.FutureTask.get(FutureTask.java:205)     at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 
22/12/08 09:28:26 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' 
timeout, java.util.concurrent.TimeoutException 
java.util.concurrent.TimeoutException     at 
java.util.concurrent.FutureTask.get(FutureTask.java:205)     at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 
22/12/08 09:28:36 ERROR ShutdownHookManager: ShutdownHookManger shutdown 
forcefully. {code}
>From the log, it seems that the shutdown hook of SparkContext is hang after 
>the UI is closed. Finally, the hadoop shutdown manager threw a timeout 
>exception and shutdown forcefully.

This eventually led to the Spark Application being marked as FAILED by Yarn, 
because the unregister in the ApplicationMaster was not executed.

 

>From the code of SparkContext # stop(), after closing the web UI, it is 
>metricsSystem # report. However, this method may be blocked for a long time 
>for various reasons (such as network timeout), which is the root cause of the 
>final shutdown hook timeout.

In our scenario, the network is unstable during an abnormal period of time, 
which causes sinks to take a long time to throw a connection time out 
exception, which directly causes the SparkContext to fail to stop within 10s.
{code:java}
Utils.tryLogNonFatalError {
  _ui.foreach(_.stop())
}
if (env != null) {
  Utils.tryLogNonFatalError {
    env.metricsSystem.report()
  }
} {code}

  was:
My issue is similar to: SPARK-31625( 
[https://github.com/apache/spark/pull/28435).]

In the scenario where the shutdown hook does not run (e.g., timeouts, etc.), 
the application is not unregistered, resulting in YARN RM resubmitting the 
application even if it succeeded.

```scala

22/12/08 09:28:06 INFO ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0
22/12/08 09:28:06 INFO SparkContext :Invoking stop() from shut down hook 
22/12/08 09:28:06 INFO SparkContext :SparklJI : Stopped Spark web UI at xxx
22/12/08 09:28:16 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException     
at java.util.concurrent.FutureTask.get(FutureTask.java:205)     at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 
22/12/08 09:28:26 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' 
timeout, java.util.concurrent.TimeoutException 
java.util.concurrent.TimeoutException     at 
java.util.concurrent.FutureTask.get(FutureTask.java:205)     at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 
22/12/08 09:28:36 ERROR ShutdownHookManager: ShutdownHookManger shutdown 
forcefully.

```

>From the log, it seems that the shutdown hook of SparkContext is hang after 
>the UI is closed. Finally, the hadoop shutdown manager threw a timeout 
>exception and shutdown forcefully.

This eventually led to the Spark Application being marked as FAILED by Yarn, 
because the unregister in the ApplicationMaster was not executed.

 


> MetricsSystem report takes too much time, which may lead to spark application 
> failed on yarn.
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-41483
>                 URL: https://issues.apache.org/jira/browse/SPARK-41483
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.4.8
>            Reporter: Deng An
>            Priority: Major
>
> My issue is similar to: SPARK-31625( 
> [https://github.com/apache/spark/pull/28435).]
> In the scenario where the shutdown hook does not run (e.g., timeouts, etc.), 
> the application is not unregistered, resulting in YARN RM resubmitting the 
> application even if it succeeded.
> {code:java}
> 22/12/08 09:28:06 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 22/12/08 09:28:06 INFO SparkContext :Invoking stop() from shut down hook 
> 22/12/08 09:28:06 INFO SparkContext :SparklJI : Stopped Spark web UI at xxx
> 22/12/08 09:28:16 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
> java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException   
>   at java.util.concurrent.FutureTask.get(FutureTask.java:205)     at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 
> 22/12/08 09:28:26 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' 
> timeout, java.util.concurrent.TimeoutException 
> java.util.concurrent.TimeoutException     at 
> java.util.concurrent.FutureTask.get(FutureTask.java:205)     at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 
> 22/12/08 09:28:36 ERROR ShutdownHookManager: ShutdownHookManger shutdown 
> forcefully. {code}
> From the log, it seems that the shutdown hook of SparkContext is hang after 
> the UI is closed. Finally, the hadoop shutdown manager threw a timeout 
> exception and shutdown forcefully.
> This eventually led to the Spark Application being marked as FAILED by Yarn, 
> because the unregister in the ApplicationMaster was not executed.
>  
> From the code of SparkContext # stop(), after closing the web UI, it is 
> metricsSystem # report. However, this method may be blocked for a long time 
> for various reasons (such as network timeout), which is the root cause of the 
> final shutdown hook timeout.
> In our scenario, the network is unstable during an abnormal period of time, 
> which causes sinks to take a long time to throw a connection time out 
> exception, which directly causes the SparkContext to fail to stop within 10s.
> {code:java}
> Utils.tryLogNonFatalError {
>   _ui.foreach(_.stop())
> }
> if (env != null) {
>   Utils.tryLogNonFatalError {
>     env.metricsSystem.report()
>   }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to