[ 
https://issues.apache.org/jira/browse/SPARK-24981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hieu Tri Huynh updated SPARK-24981:
-----------------------------------
    Priority: Minor  (was: Major)

> ShutdownHook timeout causes job to fail when succeeded when SparkContext 
> stop() not called by user program
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24981
>                 URL: https://issues.apache.org/jira/browse/SPARK-24981
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.1
>            Reporter: Hieu Tri Huynh
>            Priority: Minor
>
> When user does not stop the SparkContext at the end of their program, 
> ShutdownHookManger will stop the SparkContext. However, each shutdown hook is 
> only given 10s to run, it will be interrupted and cancelled after that given 
> time. In case stopping spark context takes longer than 10s, 
> InterruptedException will be thrown, and the job will fail even though it 
> succeeded before. An example of this is shown below.
> I think there are a few ways to fix this, below are the 2 ways that I have 
> now: 
> 1. After user program finished, we can check if user program stoped 
> SparkContext or not. If user didn't stop the SparkContext, we can stop it 
> before finishing the userThread. By doing so, SparkContext.stop() can take as 
> much time as it needed.
> 2. We can just catch the InterruptedException thrown by ShutdownHookManger 
> while we are stopping the SparkContext, and ignoring all the things that we 
> haven't stopped inside the SparkContext. Since we are shutting down, I think 
> it will be okay to ignore those things.
>  
> {code:java}
> 18/07/31 17:11:49 ERROR Utils: Uncaught exception in thread pool-4-thread-1
> java.lang.InterruptedException
>       at java.lang.Object.wait(Native Method)
>       at java.lang.Thread.join(Thread.java:1249)
>       at java.lang.Thread.join(Thread.java:1323)
>       at 
> org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:136)
>       at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>       at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>       at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>       at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>       at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>       at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>       at 
> org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219)
>       at 
> org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1922)
>       at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1360)
>       at org.apache.spark.SparkContext.stop(SparkContext.scala:1921)
>       at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:573)
>       at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
>       at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>       at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>       at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>       at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991)
>       at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>       at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>       at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>       at scala.util.Try$.apply(Try.scala:192)
>       at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>       at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> 18/07/31 17:11:49 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
> java.util.concurrent.TimeoutException
> java.util.concurrent.TimeoutException
>       at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>       at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to