[ 
https://issues.apache.org/jira/browse/SPARK-41483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deng An updated SPARK-41483:
----------------------------
    Description: 
My issue is similar to: SPARK-31625( 
[https://github.com/apache/spark/pull/28435).]

In the scenario where the shutdown hook does not run (e.g., timeouts, etc.), 
the application is not unregistered, resulting in YARN RM resubmitting the 
application even if it succeeded.
{code:java}
22/12/08 09:28:06 INFO ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0
22/12/08 09:28:06 INFO SparkContext :Invoking stop() from shut down hook 
22/12/08 09:28:06 INFO SparkContext :SparkUI : Stopped Spark web UI at xxx
22/12/08 09:28:16 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException     
at java.util.concurrent.FutureTask.get(FutureTask.java:205)     at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 
22/12/08 09:28:26 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' 
timeout, java.util.concurrent.TimeoutException 
java.util.concurrent.TimeoutException     at 
java.util.concurrent.FutureTask.get(FutureTask.java:205)     at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 
22/12/08 09:28:36 ERROR ShutdownHookManager: ShutdownHookManger shutdown 
forcefully. {code}
>From the log, it seems that the shutdown hook of SparkContext is hang after 
>the UI is closed. Finally, the hadoop shutdown manager threw a timeout 
>exception and shutdown forcefully.

This eventually led to the Spark Application being marked as FAILED by Yarn, 
because the unregister in the ApplicationMaster was not executed.

Review the code in SparkContext#stop(), after closed the web UI, it is 
metricsSystem#report(). However, this method may be blocked for a long time for 
various reasons (such as network timeout), which is the root cause of the final 
shutdown hook timeout.

In our scenario, the network is unstable during an abnormal period of time, 
which causes sinks to take a long time to throw a connection time out 
exception, which directly causes the SparkContext to fail to stop within 10s.
{code:java}
Utils.tryLogNonFatalError {
  _ui.foreach(_.stop())
}
if (env != null) {
  Utils.tryLogNonFatalError {
    env.metricsSystem.report()
  }
} {code}
 

We use Spark 2. x, and not sure whether Spark 3. x has similar problems. Please 
someone review it.

  was:
My issue is similar to: SPARK-31625( 
[https://github.com/apache/spark/pull/28435).]

In the scenario where the shutdown hook does not run (e.g., timeouts, etc.), 
the application is not unregistered, resulting in YARN RM resubmitting the 
application even if it succeeded.
{code:java}
22/12/08 09:28:06 INFO ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0
22/12/08 09:28:06 INFO SparkContext :Invoking stop() from shut down hook 
22/12/08 09:28:06 INFO SparkContext :SparkUI : Stopped Spark web UI at xxx
22/12/08 09:28:16 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException     
at java.util.concurrent.FutureTask.get(FutureTask.java:205)     at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 
22/12/08 09:28:26 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' 
timeout, java.util.concurrent.TimeoutException 
java.util.concurrent.TimeoutException     at 
java.util.concurrent.FutureTask.get(FutureTask.java:205)     at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 
22/12/08 09:28:36 ERROR ShutdownHookManager: ShutdownHookManger shutdown 
forcefully. {code}
>From the log, it seems that the shutdown hook of SparkContext is hang after 
>the UI is closed. Finally, the hadoop shutdown manager threw a timeout 
>exception and shutdown forcefully.

This eventually led to the Spark Application being marked as FAILED by Yarn, 
because the unregister in the ApplicationMaster was not executed.

Review the code in SparkContext#stop(), after closed the web UI, it is 
metricsSystem#report(). However, this method may be blocked for a long time for 
various reasons (such as network timeout), which is the root cause of the final 
shutdown hook timeout.

In our scenario, the network is unstable during an abnormal period of time, 
which causes sinks to take a long time to throw a connection time out 
exception, which directly causes the SparkContext to fail to stop within 10s.
{code:java}
Utils.tryLogNonFatalError {
  _ui.foreach(_.stop())
}
if (env != null) {
  Utils.tryLogNonFatalError {
    env.metricsSystem.report()
  }
} {code}


> MetricsSystem report may cost too much time, which will lead to spark 
> application failed on yarn.
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-41483
>                 URL: https://issues.apache.org/jira/browse/SPARK-41483
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.4.8
>            Reporter: Deng An
>            Priority: Major
>
> My issue is similar to: SPARK-31625( 
> [https://github.com/apache/spark/pull/28435).]
> In the scenario where the shutdown hook does not run (e.g., timeouts, etc.), 
> the application is not unregistered, resulting in YARN RM resubmitting the 
> application even if it succeeded.
> {code:java}
> 22/12/08 09:28:06 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 22/12/08 09:28:06 INFO SparkContext :Invoking stop() from shut down hook 
> 22/12/08 09:28:06 INFO SparkContext :SparkUI : Stopped Spark web UI at xxx
> 22/12/08 09:28:16 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
> java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException   
>   at java.util.concurrent.FutureTask.get(FutureTask.java:205)     at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 
> 22/12/08 09:28:26 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' 
> timeout, java.util.concurrent.TimeoutException 
> java.util.concurrent.TimeoutException     at 
> java.util.concurrent.FutureTask.get(FutureTask.java:205)     at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 
> 22/12/08 09:28:36 ERROR ShutdownHookManager: ShutdownHookManger shutdown 
> forcefully. {code}
> From the log, it seems that the shutdown hook of SparkContext is hang after 
> the UI is closed. Finally, the hadoop shutdown manager threw a timeout 
> exception and shutdown forcefully.
> This eventually led to the Spark Application being marked as FAILED by Yarn, 
> because the unregister in the ApplicationMaster was not executed.
> Review the code in SparkContext#stop(), after closed the web UI, it is 
> metricsSystem#report(). However, this method may be blocked for a long time 
> for various reasons (such as network timeout), which is the root cause of the 
> final shutdown hook timeout.
> In our scenario, the network is unstable during an abnormal period of time, 
> which causes sinks to take a long time to throw a connection time out 
> exception, which directly causes the SparkContext to fail to stop within 10s.
> {code:java}
> Utils.tryLogNonFatalError {
>   _ui.foreach(_.stop())
> }
> if (env != null) {
>   Utils.tryLogNonFatalError {
>     env.metricsSystem.report()
>   }
> } {code}
>  
> We use Spark 2. x, and not sure whether Spark 3. x has similar problems. 
> Please someone review it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to