This is an automated email from the ASF dual-hosted git repository. wuyi pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.0 by this push: new be441e8 [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv be441e8 is described below commit be441e84069acc711ea848c69ae5bd55a7c93531 Author: yi.wu <yi...@databricks.com> AuthorDate: Wed Jan 5 10:48:16 2022 +0800 [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv ### What changes were proposed in this pull request? This PR proposes to let `WorkerWatcher` run `System.exit` in a separate thread instead of some thread of `RpcEnv`. ### Why are the changes needed? `System.exit` will trigger the shutdown hook to run `executor.stop`, which will result in the same deadlock issue with SPARK-14180. But note that since Spark upgrades to Hadoop 3 recently, each hook now will have a [timeout threshold](https://github.com/apache/hadoop/blob/d4794dd3b2ba365a9d95ad6aafcf43a1ea40f777/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ShutdownHookManager.java#L205-L209) which forcibly interrupt the hook execution once reaches timeout. [...] ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested manually. Closes #35069 from Ngone51/fix-workerwatcher-exit. Authored-by: yi.wu <yi...@databricks.com> Signed-off-by: yi.wu <yi...@databricks.com> (cherry picked from commit 639d6f40e597d79c680084376ece87e40f4d2366) Signed-off-by: yi.wu <yi...@databricks.com> --- .../main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala b/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala index efffc9f..b7a5728 100644 --- a/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala +++ b/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala @@ -54,8 +54,12 @@ private[spark] class WorkerWatcher( if (isTesting) { isShutDown = true } else if (isChildProcessStopping.compareAndSet(false, true)) { - // SPARK-35714: avoid the duplicate call of `System.exit` to avoid the dead lock - System.exit(-1) + // SPARK-35714: avoid the duplicate call of `System.exit` to avoid the dead lock. + // Same as SPARK-14180, we should run `System.exit` in a separate thread to avoid + // dead lock since `System.exit` will trigger the shutdown hook of `executor.stop`. + new Thread("WorkerWatcher-exit-executor") { + override def run(): Unit = System.exit(-1) + }.start() } override def receive: PartialFunction[Any, Unit] = { --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org