[ https://issues.apache.org/jira/browse/SPARK-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581383#comment-14581383 ]
Saisai Shao commented on SPARK-8297: ------------------------------------ Hi [~mridulm80], I tried with latest master branch with spark-shell under yarn-client mode. I simulated executor crash by kill it with "kill -9". From my observation, the scheduler backend is notified when executor is lost, here is the log: {noformat} scala> 15/06/11 11:08:38 ERROR cluster.YarnScheduler: Lost executor 1 on jerryshao-desktop: remote Rpc client disassociated 15/06/11 11:08:38 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@jerryshao-desktop:50766] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 15/06/11 11:08:38 INFO scheduler.DAGScheduler: Executor lost: 1 (epoch 0) 15/06/11 11:08:38 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 15/06/11 11:08:38 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, jerryshao-desktop, 48633) 15/06/11 11:08:38 INFO storage.BlockManagerMaster: Removed 1 successfully in removeExecutor {noformat} This is the driver log, also YarnAllocator is correctly remove the metadata in {{processCompletedContainers()}} I cannot fully catch your meaning, would you please describe a little specifically? > Scheduler backend is not notified in case node fails in YARN > ------------------------------------------------------------ > > Key: SPARK-8297 > URL: https://issues.apache.org/jira/browse/SPARK-8297 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.0 > Environment: Spark on yarn - both client and cluster mode. > Reporter: Mridul Muralidharan > Priority: Critical > > When a node crashes, yarn detects the failure and notifies spark - but this > information is not propagated to scheduler backend (unlike in mesos mode, for > example). > It results in repeated re-execution of stages (due to FetchFailedException on > shuffle side), resulting finally in application failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org