[ 
https://issues.apache.org/jira/browse/SPARK-16230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reopened SPARK-16230:
----------------------------------

> Executors self-killing after being assigned tasks while still in init
> ---------------------------------------------------------------------
>
>                 Key: SPARK-16230
>                 URL: https://issues.apache.org/jira/browse/SPARK-16230
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Tejas Patil
>            Assignee: Tejas Patil
>            Priority: Minor
>             Fix For: 2.0.1, 2.1.0
>
>
> I see this happening frequently in our prod clusters:
> * EXECUTOR:   
> [CoarseGrainedExecutorBackend|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L61]
>  sends request to register itself to the driver.
> * DRIVER: Registers executor and 
> [replies|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L179]
> * EXECUTOR:  ExecutorBackend receives ACK and [starts creating an 
> Executor|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L81]
> * DRIVER:  Tries to launch a task as it knows there is a new executor. Sends 
> a 
> [LaunchTask|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L268]
>  to this new executor.
> * EXECUTOR:  Executor is not init'ed (one of the reasons I have seen is 
> because it was still trying to register to local external shuffle service). 
> Meanwhile, receives a `LaunchTask`. [Kills 
> itself|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L90]
>  as Executor is not init'ed.
> The driver assumes that Executor is ready to accept tasks as soon as it is 
> registered but thats not true.
> How this affects jobs / cluster:
> * We waste time + resources with these executors but they don't do any 
> meaningful computation.
> * Driver thinks that the executor has started running the task but since the 
> Executor has self killed, it does not tell driver (BTW: this is also another 
> issue which I think could be fixed separately). Driver waits for 10 mins and 
> then declares the executor dead. This adds up to the latency of the job. 
> Plus, failure attempts also gets bumped up for the tasks despite the tasks 
> were never started. For unlucky tasks, this might cause the job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to