Shixiong Zhu created SPARK-27348:
------------------------------------

             Summary: HeartbeatReceiver doesn't remove lost executors from 
CoarseGrainedSchedulerBackend
                 Key: SPARK-27348
                 URL: https://issues.apache.org/jira/browse/SPARK-27348
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.0
            Reporter: Shixiong Zhu


When a heartbeat timeout happens in HeartbeatReceiver, it doesn't remove lost 
executors from CoarseGrainedSchedulerBackend. When a connection is gracefully 
shut down, CoarseGrainedSchedulerBackend will not receive a disconnect event. 
In this case, CoarseGrainedSchedulerBackend still thinks a lost executor is 
still alive. CoarseGrainedSchedulerBackend may ask TaskScheduler to run tasks 
on this lost executor. This task will never finish and the job will hang 
forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to