Hi, I am running a spark application on Yarn in cluster mode. One of my executor appears to be in hang state, for a long time, and gets finally killed by the driver.
As compared to other executors, It have not received StopExecutor message from the driver. Here are the logs at the end of this container (C_1): -------------------------------------------------------------------------------- 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Done removing broadcast 36, response is 2 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Sent response: 2 to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$aB] 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection to TMO-GCR70/192.168.162.70:9000 from admin: closed 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection to TMO-GCR70/192.168.162.70:9000 from admin: stopped, remaining connections 0 15/02/26 18:17:32 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [] with renew id 1 executed 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [] with renew id 1 expired 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [] with renew id 1 exited 15/02/26 20:33:13 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM NOTE that it has no logs for more than 2hrs. Here are the logs at the end of normal container ( C_2): ------------------------------------------------------------------------------------ 15/02/26 20:33:09 DEBUG storage.BlockManagerSlaveActor: Sent response: 2 to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$D+b] 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor] received message StopExecutor from Actor[akka.tcp://sparkDriver@TMO-DN73 :37906/user/CoarseGrainedScheduler#160899257] 15/02/26 20:33:10 INFO executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown 15/02/26 20:33:10 INFO storage.MemoryStore: MemoryStore cleared 15/02/26 20:33:10 INFO storage.BlockManager: BlockManager stopped 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor] *handled message (181.499835 ms) StopExecutor* from Actor[akka.tcp://sparkDriver@TMO-DN73 :37906/user/CoarseGrainedScheduler#160899257] 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache: org.apache.hadoop.ipc.Client@76a68bd4 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache: org.apache.hadoop.ipc.Client@76a68bd4 15/02/26 20:33:10 DEBUG ipc.Client: removing client from cache: org.apache.hadoop.ipc.Client@76a68bd4 15/02/26 20:33:10 DEBUG ipc.Client: stopping actual client because no more references remain: org.apache.hadoop.ipc.Client@76a68bd4 15/02/26 20:33:10 DEBUG ipc.Client: Stopping client 15/02/26 20:33:10 DEBUG storage.DiskBlockManager: Shutdown hook called 15/02/26 20:33:10 DEBUG util.Utils: Shutdown hook called At the driver side, i can see the logs related to heartbeat messages from C_1 till 20:05:00 ------------------------------------------------------------------------------------------ 15/02/26 20:05:00 DEBUG spark.HeartbeatReceiver: [actor] received message Heartbeat(7,[Lscala.Tuple2;@151e5ce6,BlockManagerId(7, TMO-DN73, 34106)) from Actor[akka.tcp://sparkExecutor@TMO-DN73:43671/temp/$fn] After this, it continues to receive the heartbeat from other executors except this one, and here follows the message responsible for it's SIGTERM: ------------------------------------------------------------------------------------------------------------ 15/02/26 20:06:20 WARN storage.BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, TMO-DN73, 34106) with no recent heart beats: 80515ms exceeds 45000ms I am using spark 1.2.1. Any pointer(s) ? Thanks, Twinkle