[ https://issues.apache.org/jira/browse/SPARK-41483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Deng An updated SPARK-41483: ---------------------------- Description: My issue is similar to: SPARK-31625( [https://github.com/apache/spark/pull/28435).] In the scenario where the shutdown hook does not run (e.g., timeouts, etc.), the application is not unregistered, resulting in YARN RM resubmitting the application even if it succeeded. {code:java} 22/12/08 09:28:06 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 22/12/08 09:28:06 INFO SparkContext :Invoking stop() from shut down hook 22/12/08 09:28:06 INFO SparkContext :SparklJI : Stopped Spark web UI at xxx 22/12/08 09:28:16 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 22/12/08 09:28:26 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 22/12/08 09:28:36 ERROR ShutdownHookManager: ShutdownHookManger shutdown forcefully. {code} >From the log, it seems that the shutdown hook of SparkContext is hang after >the UI is closed. Finally, the hadoop shutdown manager threw a timeout >exception and shutdown forcefully. This eventually led to the Spark Application being marked as FAILED by Yarn, because the unregister in the ApplicationMaster was not executed. >From the code of SparkContext # stop(), after closing the web UI, it is >metricsSystem # report. However, this method may be blocked for a long time >for various reasons (such as network timeout), which is the root cause of the >final shutdown hook timeout. In our scenario, the network is unstable during an abnormal period of time, which causes sinks to take a long time to throw a connection time out exception, which directly causes the SparkContext to fail to stop within 10s. {code:java} Utils.tryLogNonFatalError { _ui.foreach(_.stop()) } if (env != null) { Utils.tryLogNonFatalError { env.metricsSystem.report() } } {code} was: My issue is similar to: SPARK-31625( [https://github.com/apache/spark/pull/28435).] In the scenario where the shutdown hook does not run (e.g., timeouts, etc.), the application is not unregistered, resulting in YARN RM resubmitting the application even if it succeeded. ```scala 22/12/08 09:28:06 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 22/12/08 09:28:06 INFO SparkContext :Invoking stop() from shut down hook 22/12/08 09:28:06 INFO SparkContext :SparklJI : Stopped Spark web UI at xxx 22/12/08 09:28:16 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 22/12/08 09:28:26 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 22/12/08 09:28:36 ERROR ShutdownHookManager: ShutdownHookManger shutdown forcefully. ``` >From the log, it seems that the shutdown hook of SparkContext is hang after >the UI is closed. Finally, the hadoop shutdown manager threw a timeout >exception and shutdown forcefully. This eventually led to the Spark Application being marked as FAILED by Yarn, because the unregister in the ApplicationMaster was not executed. > MetricsSystem report takes too much time, which may lead to spark application > failed on yarn. > --------------------------------------------------------------------------------------------- > > Key: SPARK-41483 > URL: https://issues.apache.org/jira/browse/SPARK-41483 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.4.8 > Reporter: Deng An > Priority: Major > > My issue is similar to: SPARK-31625( > [https://github.com/apache/spark/pull/28435).] > In the scenario where the shutdown hook does not run (e.g., timeouts, etc.), > the application is not unregistered, resulting in YARN RM resubmitting the > application even if it succeeded. > {code:java} > 22/12/08 09:28:06 INFO ApplicationMaster: Final app status: SUCCEEDED, > exitCode: 0 > 22/12/08 09:28:06 INFO SparkContext :Invoking stop() from shut down hook > 22/12/08 09:28:06 INFO SparkContext :SparklJI : Stopped Spark web UI at xxx > 22/12/08 09:28:16 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, > java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException > at java.util.concurrent.FutureTask.get(FutureTask.java:205) at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) > 22/12/08 09:28:26 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' > timeout, java.util.concurrent.TimeoutException > java.util.concurrent.TimeoutException at > java.util.concurrent.FutureTask.get(FutureTask.java:205) at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) > 22/12/08 09:28:36 ERROR ShutdownHookManager: ShutdownHookManger shutdown > forcefully. {code} > From the log, it seems that the shutdown hook of SparkContext is hang after > the UI is closed. Finally, the hadoop shutdown manager threw a timeout > exception and shutdown forcefully. > This eventually led to the Spark Application being marked as FAILED by Yarn, > because the unregister in the ApplicationMaster was not executed. > > From the code of SparkContext # stop(), after closing the web UI, it is > metricsSystem # report. However, this method may be blocked for a long time > for various reasons (such as network timeout), which is the root cause of the > final shutdown hook timeout. > In our scenario, the network is unstable during an abnormal period of time, > which causes sinks to take a long time to throw a connection time out > exception, which directly causes the SparkContext to fail to stop within 10s. > {code:java} > Utils.tryLogNonFatalError { > _ui.foreach(_.stop()) > } > if (env != null) { > Utils.tryLogNonFatalError { > env.metricsSystem.report() > } > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org