[ https://issues.apache.org/jira/browse/SPARK-41483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Deng An updated SPARK-41483: ---------------------------- Description: My issue is similar to: SPARK-31625( [https://github.com/apache/spark/pull/28435).] In the scenario where the shutdown hook does not run (e.g., timeouts, etc.), the application is not unregistered, resulting in YARN RM resubmitting the application even if it succeeded. {code:java} 22/12/08 09:28:06 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 22/12/08 09:28:06 INFO SparkContext :Invoking stop() from shut down hook 22/12/08 09:28:06 INFO SparkContext :SparkUI : Stopped Spark web UI at xxx 22/12/08 09:28:16 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 22/12/08 09:28:26 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 22/12/08 09:28:36 ERROR ShutdownHookManager: ShutdownHookManger shutdown forcefully. {code} >From the log, it seems that the shutdown hook of SparkContext is hang after >the UI is closed. Finally, the hadoop shutdown manager threw a timeout >exception and shutdown forcefully. This eventually led to the Spark Application being marked as FAILED by Yarn, because the unregister in the ApplicationMaster was not executed. Review the code in SparkContext#stop(), after closed the web UI, it is metricsSystem#report(). However, this method may be blocked for a long time for various reasons (such as network timeout), which is the root cause of the final shutdown hook timeout. In our scenario, the network is unstable during an abnormal period of time, which causes sinks to take a long time to throw a connection time out exception, which directly causes the SparkContext to fail to stop within 10s. {code:java} Utils.tryLogNonFatalError { _ui.foreach(_.stop()) } if (env != null) { Utils.tryLogNonFatalError { env.metricsSystem.report() } } {code} We use Spark 2. x, and not sure whether Spark 3. x has similar problems. Please someone review it. was: My issue is similar to: SPARK-31625( [https://github.com/apache/spark/pull/28435).] In the scenario where the shutdown hook does not run (e.g., timeouts, etc.), the application is not unregistered, resulting in YARN RM resubmitting the application even if it succeeded. {code:java} 22/12/08 09:28:06 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 22/12/08 09:28:06 INFO SparkContext :Invoking stop() from shut down hook 22/12/08 09:28:06 INFO SparkContext :SparkUI : Stopped Spark web UI at xxx 22/12/08 09:28:16 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 22/12/08 09:28:26 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 22/12/08 09:28:36 ERROR ShutdownHookManager: ShutdownHookManger shutdown forcefully. {code} >From the log, it seems that the shutdown hook of SparkContext is hang after >the UI is closed. Finally, the hadoop shutdown manager threw a timeout >exception and shutdown forcefully. This eventually led to the Spark Application being marked as FAILED by Yarn, because the unregister in the ApplicationMaster was not executed. Review the code in SparkContext#stop(), after closed the web UI, it is metricsSystem#report(). However, this method may be blocked for a long time for various reasons (such as network timeout), which is the root cause of the final shutdown hook timeout. In our scenario, the network is unstable during an abnormal period of time, which causes sinks to take a long time to throw a connection time out exception, which directly causes the SparkContext to fail to stop within 10s. {code:java} Utils.tryLogNonFatalError { _ui.foreach(_.stop()) } if (env != null) { Utils.tryLogNonFatalError { env.metricsSystem.report() } } {code} > MetricsSystem report may cost too much time, which will lead to spark > application failed on yarn. > ------------------------------------------------------------------------------------------------- > > Key: SPARK-41483 > URL: https://issues.apache.org/jira/browse/SPARK-41483 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.4.8 > Reporter: Deng An > Priority: Major > > My issue is similar to: SPARK-31625( > [https://github.com/apache/spark/pull/28435).] > In the scenario where the shutdown hook does not run (e.g., timeouts, etc.), > the application is not unregistered, resulting in YARN RM resubmitting the > application even if it succeeded. > {code:java} > 22/12/08 09:28:06 INFO ApplicationMaster: Final app status: SUCCEEDED, > exitCode: 0 > 22/12/08 09:28:06 INFO SparkContext :Invoking stop() from shut down hook > 22/12/08 09:28:06 INFO SparkContext :SparkUI : Stopped Spark web UI at xxx > 22/12/08 09:28:16 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, > java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException > at java.util.concurrent.FutureTask.get(FutureTask.java:205) at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) > 22/12/08 09:28:26 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' > timeout, java.util.concurrent.TimeoutException > java.util.concurrent.TimeoutException at > java.util.concurrent.FutureTask.get(FutureTask.java:205) at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) > 22/12/08 09:28:36 ERROR ShutdownHookManager: ShutdownHookManger shutdown > forcefully. {code} > From the log, it seems that the shutdown hook of SparkContext is hang after > the UI is closed. Finally, the hadoop shutdown manager threw a timeout > exception and shutdown forcefully. > This eventually led to the Spark Application being marked as FAILED by Yarn, > because the unregister in the ApplicationMaster was not executed. > Review the code in SparkContext#stop(), after closed the web UI, it is > metricsSystem#report(). However, this method may be blocked for a long time > for various reasons (such as network timeout), which is the root cause of the > final shutdown hook timeout. > In our scenario, the network is unstable during an abnormal period of time, > which causes sinks to take a long time to throw a connection time out > exception, which directly causes the SparkContext to fail to stop within 10s. > {code:java} > Utils.tryLogNonFatalError { > _ui.foreach(_.stop()) > } > if (env != null) { > Utils.tryLogNonFatalError { > env.metricsSystem.report() > } > } {code} > > We use Spark 2. x, and not sure whether Spark 3. x has similar problems. > Please someone review it. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org