[ https://issues.apache.org/jira/browse/SPARK-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988940#comment-13988940 ]
Nan Zhu commented on SPARK-1175: -------------------------------- this should have been fixed in https://github.com/apache/spark/commit/f99af8529b6969986f0c3e03f6ff9b7bb9d53ece > on shutting down a long running job, the cluster does not accept new jobs and > gets hung > --------------------------------------------------------------------------------------- > > Key: SPARK-1175 > URL: https://issues.apache.org/jira/browse/SPARK-1175 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 0.8.1, 0.9.0 > Reporter: Tal Sliwowicz > Assignee: Nan Zhu > Labels: shutdown, worker > Fix For: 1.0.0 > > > When shutting down a long processing job (24+ hours) that runs periodically > on the same context and generates a lot of shuffles (many hundreds of GB) the > spark workers get hung for a long while and the cluster does not accept new > jobs. The only way to proceed is to kill -9 the workers. > This is a big problem because when multiple contexts run on the same cluster, > one mast stop them all for a simple restart. > The context is stopped using sc.stop() > This happens both in standalone mode and under mesos. > We suspect this is caused by the "delete Spark local dirs" thread. Attached a > thread dump of the worker. Also, the relevant part may be: > "SIGTERM handler" - Thread t@41040 > java.lang.Thread.State: BLOCKED > at java.lang.Shutdown.exit(Shutdown.java:168) > - waiting to lock <69eab6a3> (a java.lang.Class) owned by "SIGTERM > handler" t@41038 > at java.lang.Terminator$1.handle(Terminator.java:35) > at sun.misc.Signal$1.run(Signal.java:195) > at java.lang.Thread.run(Thread.java:662) > Locked ownable synchronizers: > - None > "delete Spark local dirs" - Thread t@40 > java.lang.Thread.State: RUNNABLE > at java.io.UnixFileSystem.delete0(Native Method) > at java.io.UnixFileSystem.delete(UnixFileSystem.java:251) > at java.io.File.delete(File.java:904) > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:482) > at > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:479) > at > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:478) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:478) > at > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:479) > at > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:478) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:478) > at > org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$2.apply(DiskBlockManager.scala:141) > at > org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$2.apply(DiskBlockManager.scala:139) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.storage.DiskBlockManager$$anon$1.run(DiskBlockManager.scala:139) > Locked ownable synchronizers: > - None > "SIGTERM handler" - Thread t@41038 > java.lang.Thread.State: WAITING > at java.lang.Object.wait(Native Method) > - waiting on <355c6c8d> (a > org.apache.spark.storage.DiskBlockManager$$anon$1) > at java.lang.Thread.join(Thread.java:1186) > at java.lang.Thread.join(Thread.java:1239) > at > java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:79) > at > java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:24) > at java.lang.Shutdown.runHooks(Shutdown.java:79) > at java.lang.Shutdown.sequence(Shutdown.java:123) > at java.lang.Shutdown.exit(Shutdown.java:168) > - locked <69eab6a3> (a java.lang.Class) > at java.lang.Terminator$1.handle(Terminator.java:35) > at sun.misc.Signal$1.run(Signal.java:195) > at java.lang.Thread.run(Thread.java:662) > Locked ownable synchronizers: > - None -- This message was sent by Atlassian JIRA (v6.2#6252)