[jira] [Commented] (SPARK-1175) on shutting down a long running job, the cluster does not accept new jobs and gets hung

Nan Zhu (JIRA) Sun, 04 May 2014 01:13:17 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988940#comment-13988940
 ]


Nan Zhu commented on SPARK-1175:
--------------------------------

this should have been fixed in 
https://github.com/apache/spark/commit/f99af8529b6969986f0c3e03f6ff9b7bb9d53ece

> on shutting down a long running job, the cluster does not accept new jobs and 
> gets hung
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-1175
>                 URL: https://issues.apache.org/jira/browse/SPARK-1175
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Tal Sliwowicz
>            Assignee: Nan Zhu
>              Labels: shutdown, worker
>             Fix For: 1.0.0
>
>
> When shutting down a long processing job (24+ hours) that runs periodically 
> on the same context and generates a lot of shuffles (many hundreds of GB) the 
> spark workers get hung for a long while and the cluster does not accept new 
> jobs. The only way to proceed is to kill -9 the workers.
> This is a big problem because when multiple contexts run on the same cluster, 
> one mast stop them all for a simple restart.
> The context is stopped using sc.stop()
> This happens both in standalone mode and under mesos.
> We suspect this is caused by the "delete Spark local dirs" thread. Attached a 
> thread dump of the worker. Also, the relevant part may be:
> "SIGTERM handler" - Thread t@41040
>    java.lang.Thread.State: BLOCKED
>       at java.lang.Shutdown.exit(Shutdown.java:168)
>       - waiting to lock <69eab6a3> (a java.lang.Class) owned by "SIGTERM 
> handler" t@41038
>       at java.lang.Terminator$1.handle(Terminator.java:35)
>       at sun.misc.Signal$1.run(Signal.java:195)
>       at java.lang.Thread.run(Thread.java:662)
>    Locked ownable synchronizers:
>       - None
> "delete Spark local dirs" - Thread t@40
>    java.lang.Thread.State: RUNNABLE
>       at java.io.UnixFileSystem.delete0(Native Method)
>       at java.io.UnixFileSystem.delete(UnixFileSystem.java:251)
>       at java.io.File.delete(File.java:904)
>       at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:482)
>       at 
> org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:479)
>       at 
> org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:478)
>       at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>       at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>       at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:478)
>       at 
> org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:479)
>       at 
> org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:478)
>       at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>       at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>       at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:478)
>       at 
> org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$2.apply(DiskBlockManager.scala:141)
>       at 
> org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$2.apply(DiskBlockManager.scala:139)
>       at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>       at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>       at 
> org.apache.spark.storage.DiskBlockManager$$anon$1.run(DiskBlockManager.scala:139)
>    Locked ownable synchronizers:
>       - None
> "SIGTERM handler" - Thread t@41038
>    java.lang.Thread.State: WAITING
>       at java.lang.Object.wait(Native Method)
>       - waiting on <355c6c8d> (a 
> org.apache.spark.storage.DiskBlockManager$$anon$1)
>       at java.lang.Thread.join(Thread.java:1186)
>       at java.lang.Thread.join(Thread.java:1239)
>       at 
> java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:79)
>       at 
> java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:24)
>       at java.lang.Shutdown.runHooks(Shutdown.java:79)
>       at java.lang.Shutdown.sequence(Shutdown.java:123)
>       at java.lang.Shutdown.exit(Shutdown.java:168)
>       - locked <69eab6a3> (a java.lang.Class)
>       at java.lang.Terminator$1.handle(Terminator.java:35)
>       at sun.misc.Signal$1.run(Signal.java:195)
>       at java.lang.Thread.run(Thread.java:662)
>    Locked ownable synchronizers:
>       - None



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1175) on shutting down a long running job, the cluster does not accept new jobs and gets hung

Reply via email to