Nan Zhu created SPARK-21197:
-------------------------------

             Summary: Tricky use cases makes dead application struggle for a 
long duration
                 Key: SPARK-21197
                 URL: https://issues.apache.org/jira/browse/SPARK-21197
             Project: Spark
          Issue Type: Bug
          Components: DStreams, Spark Core
    Affects Versions: 2.1.1, 2.0.2
            Reporter: Nan Zhu


The use case is in Spark Streaming while the root cause is in DAGScheduler, so 
I said the component as both of DStreams and Core

Use case: 

the user has a thread periodically triggering Spark jobs, and in the same 
application, they retrieve data through Spark Streaming from somewhere....in 
the Streaming logic, an exception is thrown so that the whole application is 
supposed to be shutdown and let YARN restart it...

The user observed that after the exception is propagated to Spark core and 
SparkContext.stop() is called, after 18 hours, the application is still 
running...

The root cause is that when we call DAGScheduler.stop(), we will wait for 
eventLoop's thread to finish 
(https://github.com/apache/spark/blob/03eb6117affcca21798be25706a39e0d5a2f7288/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1704
 and 
https://github.com/apache/spark/blob/03eb6117affcca21798be25706a39e0d5a2f7288/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L40)

Since there is a thread periodically push events to DAGScheduler's event queue, 
it will never finish

a potential solution is that in EventLoop, we should allow interrupt the thread 
directly for some cases, e.g. this one, and simultaneously allow graceful 
shutdown for other cases, e.g. ListenerBus one, 





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to