[ 
https://issues.apache.org/jira/browse/SPARK-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581383#comment-14581383
 ] 

Saisai Shao commented on SPARK-8297:
------------------------------------

Hi [~mridulm80], I tried with latest master branch with spark-shell under 
yarn-client mode. I simulated executor crash by kill it with "kill -9". From my 
observation, the scheduler backend is notified when executor is lost, here is 
the log:

{noformat}
scala> 15/06/11 11:08:38 ERROR cluster.YarnScheduler: Lost executor 1 on 
jerryshao-desktop: remote Rpc client disassociated
15/06/11 11:08:38 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://sparkExecutor@jerryshao-desktop:50766] has failed, 
address is now gated for [5000] ms. Reason: [Disassociated] 
15/06/11 11:08:38 INFO scheduler.DAGScheduler: Executor lost: 1 (epoch 0)
15/06/11 11:08:38 INFO storage.BlockManagerMasterEndpoint: Trying to remove 
executor 1 from BlockManagerMaster.
15/06/11 11:08:38 INFO storage.BlockManagerMasterEndpoint: Removing block 
manager BlockManagerId(1, jerryshao-desktop, 48633)
15/06/11 11:08:38 INFO storage.BlockManagerMaster: Removed 1 successfully in 
removeExecutor
{noformat}

This is the driver log, also YarnAllocator is correctly remove the metadata in 
{{processCompletedContainers()}} I cannot fully catch your meaning, would you 
please describe a little specifically?


> Scheduler backend is not notified in case node fails in YARN
> ------------------------------------------------------------
>
>                 Key: SPARK-8297
>                 URL: https://issues.apache.org/jira/browse/SPARK-8297
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.0
>         Environment: Spark on yarn - both client and cluster mode.
>            Reporter: Mridul Muralidharan
>            Priority: Critical
>
> When a node crashes, yarn detects the failure and notifies spark - but this 
> information is not propagated to scheduler backend (unlike in mesos mode, for 
> example).
> It results in repeated re-execution of stages (due to FetchFailedException on 
> shuffle side), resulting finally in application failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to