[ https://issues.apache.org/jira/browse/SPARK-16230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351636#comment-15351636 ]
Tejas Patil commented on SPARK-16230: ------------------------------------- For now, I can think of these two options: * Change scheduler to not launch tasks on executors until it sees the first heatbeat from the executor. OR * Queue the `LaunchTask` events if `RegisteredExecutor` was received but the Executor is not up. The change will be completely in the executor side. > Executors self-killing after being assigned tasks while still in init > --------------------------------------------------------------------- > > Key: SPARK-16230 > URL: https://issues.apache.org/jira/browse/SPARK-16230 > Project: Spark > Issue Type: Bug > Components: Spark Core > Reporter: Tejas Patil > Priority: Minor > > I see this happening frequently in our prod clusters: > * EXECUTOR: > [CoarseGrainedExecutorBackend|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L61] > sends request to register itself to the driver. > * DRIVER: Registers executor and > [replies|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L179] > * EXECUTOR: ExecutorBackend receives ACK and [starts creating an > Executor|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L81] > * DRIVER: Tries to launch a task as it knows there is a new executor. Sends > a > [LaunchTask|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L268] > to this new executor. > * EXECUTOR: Executor is not init'ed (one of the reasons I have seen is > because it was still trying to register to local external shuffle service). > Meanwhile, receives a `LaunchTask`. [Kills > itself|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L90] > as Executor is not init'ed. > The driver assumes that Executor is ready to accept tasks as soon as it is > registered but thats not true. > How this affects jobs / cluster: > * We waste time + resources with these executors but they don't do any > meaningful computation. > * Driver thinks that the executor has started running the task but since the > Executor has self killed, it does not tell driver (BTW: this is also another > issue which I think could be fixed separately). Driver waits for 10 mins and > then declares the executor dead. This adds up to the latency of the job. > Plus, failure attempts also gets bumped up for the tasks despite the tasks > were never started. For unlucky tasks, this might cause the job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org