[ 
https://issues.apache.org/jira/browse/SPARK-24387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16490309#comment-16490309
 ] 

Rui Li commented on SPARK-24387:
--------------------------------

When HeartbeatReceiver finds the executor's heartbeat is timeout, it informs 
the TaskScheduler and kills the executor asynchronously. When TaskScheduler 
handles the lost executor, it tries to revive offer from the backend. So I 
think there's a race condition that the backend may make offers before killing 
the executor. And since this is the only executor left, it's offered to the 
TaskScheduler and the retried task is scheduled to it.

And when killing a heartbeat-timeout executor, we expect a replacement executor 
to be launched. But when the new executor is launched, there's no task for it 
to run. So it's kept idle until killed by dynamic allocation.

> Heartbeat-timeout executor is added back and used again
> -------------------------------------------------------
>
>                 Key: SPARK-24387
>                 URL: https://issues.apache.org/jira/browse/SPARK-24387
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0
>            Reporter: Rui Li
>            Priority: Major
>
> In our job, when there's only one task and one executor running, the 
> executor's heartbeat is lost and driver decides to remove it. However, the 
> executor is added again and the task's retry attempt is scheduled to that 
> executor, almost immediately after the executor is marked as lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to