[ https://issues.apache.org/jira/browse/SPARK-41483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-41483: ------------------------------------ Assignee: Apache Spark > MetricsSystem report takes too much time, which may lead to spark application > failed on yarn. > --------------------------------------------------------------------------------------------- > > Key: SPARK-41483 > URL: https://issues.apache.org/jira/browse/SPARK-41483 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.4.8 > Reporter: Deng An > Assignee: Apache Spark > Priority: Major > > My issue is similar to: SPARK-31625( > [https://github.com/apache/spark/pull/28435).] > In the scenario where the shutdown hook does not run (e.g., timeouts, etc.), > the application is not unregistered, resulting in YARN RM resubmitting the > application even if it succeeded. > {code:java} > 22/12/08 09:28:06 INFO ApplicationMaster: Final app status: SUCCEEDED, > exitCode: 0 > 22/12/08 09:28:06 INFO SparkContext :Invoking stop() from shut down hook > 22/12/08 09:28:06 INFO SparkContext :SparkUI : Stopped Spark web UI at xxx > 22/12/08 09:28:16 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, > java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException > at java.util.concurrent.FutureTask.get(FutureTask.java:205) at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) > 22/12/08 09:28:26 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' > timeout, java.util.concurrent.TimeoutException > java.util.concurrent.TimeoutException at > java.util.concurrent.FutureTask.get(FutureTask.java:205) at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) > 22/12/08 09:28:36 ERROR ShutdownHookManager: ShutdownHookManger shutdown > forcefully. {code} > From the log, it seems that the shutdown hook of SparkContext is hang after > the UI is closed. Finally, the hadoop shutdown manager threw a timeout > exception and shutdown forcefully. > This eventually led to the Spark Application being marked as FAILED by Yarn, > because the unregister in the ApplicationMaster was not executed. > > From the code of SparkContext # stop(), after closing the web UI, it is > metricsSystem # report. However, this method may be blocked for a long time > for various reasons (such as network timeout), which is the root cause of the > final shutdown hook timeout. > In our scenario, the network is unstable during an abnormal period of time, > which causes sinks to take a long time to throw a connection time out > exception, which directly causes the SparkContext to fail to stop within 10s. > {code:java} > Utils.tryLogNonFatalError { > _ui.foreach(_.stop()) > } > if (env != null) { > Utils.tryLogNonFatalError { > env.metricsSystem.report() > } > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org