[ https://issues.apache.org/jira/browse/SPARK-24387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16490309#comment-16490309 ]
Rui Li commented on SPARK-24387: -------------------------------- When HeartbeatReceiver finds the executor's heartbeat is timeout, it informs the TaskScheduler and kills the executor asynchronously. When TaskScheduler handles the lost executor, it tries to revive offer from the backend. So I think there's a race condition that the backend may make offers before killing the executor. And since this is the only executor left, it's offered to the TaskScheduler and the retried task is scheduled to it. And when killing a heartbeat-timeout executor, we expect a replacement executor to be launched. But when the new executor is launched, there's no task for it to run. So it's kept idle until killed by dynamic allocation. > Heartbeat-timeout executor is added back and used again > ------------------------------------------------------- > > Key: SPARK-24387 > URL: https://issues.apache.org/jira/browse/SPARK-24387 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.1.0 > Reporter: Rui Li > Priority: Major > > In our job, when there's only one task and one executor running, the > executor's heartbeat is lost and driver decides to remove it. However, the > executor is added again and the task's retry attempt is scheduled to that > executor, almost immediately after the executor is marked as lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org