Hi,

I am running a spark application on Yarn in cluster mode.
One of my executor appears to be in hang state, for  a long time, and gets
finally killed by the driver.

As compared to other executors, It have not received StopExecutor message
from the driver.

Here are the logs at the end of this container (C_1):
--------------------------------------------------------------------------------
15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Done removing
broadcast 36, response is 2
15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Sent response: 2 to
Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$aB]
15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection to
TMO-GCR70/192.168.162.70:9000 from admin: closed
15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection to
TMO-GCR70/192.168.162.70:9000 from admin: stopped, remaining connections 0
15/02/26 18:17:32 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [] with
renew id 1 executed
15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [] with
renew id 1 expired
15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [] with
renew id 1 exited
15/02/26 20:33:13 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
SIGNAL 15: SIGTERM

NOTE that it has no logs for more than 2hrs.

Here are the logs at the end of normal container ( C_2):
------------------------------------------------------------------------------------
15/02/26 20:33:09 DEBUG storage.BlockManagerSlaveActor: Sent response: 2 to
Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$D+b]
15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor]
received message StopExecutor from Actor[akka.tcp://sparkDriver@TMO-DN73
:37906/user/CoarseGrainedScheduler#160899257]
15/02/26 20:33:10 INFO executor.CoarseGrainedExecutorBackend: Driver
commanded a shutdown
15/02/26 20:33:10 INFO storage.MemoryStore: MemoryStore cleared
15/02/26 20:33:10 INFO storage.BlockManager: BlockManager stopped
15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor] *handled
message (181.499835 ms) StopExecutor* from
Actor[akka.tcp://sparkDriver@TMO-DN73
:37906/user/CoarseGrainedScheduler#160899257]
15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
Shutting down remote daemon.
15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
Remote daemon shut down; proceeding with flushing remote transports.
15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
Remoting shut down.
15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache:
org.apache.hadoop.ipc.Client@76a68bd4
15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache:
org.apache.hadoop.ipc.Client@76a68bd4
15/02/26 20:33:10 DEBUG ipc.Client: removing client from cache:
org.apache.hadoop.ipc.Client@76a68bd4
15/02/26 20:33:10 DEBUG ipc.Client: stopping actual client because no more
references remain: org.apache.hadoop.ipc.Client@76a68bd4
15/02/26 20:33:10 DEBUG ipc.Client: Stopping client
15/02/26 20:33:10 DEBUG storage.DiskBlockManager: Shutdown hook called
15/02/26 20:33:10 DEBUG util.Utils: Shutdown hook called

At the driver side, i can see the logs related to heartbeat messages from
C_1 till 20:05:00
------------------------------------------------------------------------------------------
15/02/26 20:05:00 DEBUG spark.HeartbeatReceiver: [actor] received message
Heartbeat(7,[Lscala.Tuple2;@151e5ce6,BlockManagerId(7, TMO-DN73, 34106))
from Actor[akka.tcp://sparkExecutor@TMO-DN73:43671/temp/$fn]

After this, it continues to receive the heartbeat from other executors
except this one, and here follows the message responsible for it's SIGTERM:

------------------------------------------------------------------------------------------------------------

15/02/26 20:06:20 WARN storage.BlockManagerMasterActor: Removing
BlockManager BlockManagerId(7, TMO-DN73, 34106) with no recent heart beats:
80515ms exceeds 45000ms


I am using spark 1.2.1.

Any pointer(s) ?


Thanks,

Twinkle

Reply via email to