Hi, Operations are not very extensive, as this scenario is not always reproducible. One of the executor start behaving in this manner. For this particular application, we are using 8 cores in one executors, and practically, 4 executors are launched on one machine.
This machine has good config with respect to number of cores. Somehow, to me it seems to be some akka communication issue. If i try to take thread dump of the executor, once it appears to be in trouble, then time out happens. Can it be something related to* spark.akka.threads?* On Fri, Feb 27, 2015 at 3:55 PM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > Mostly, that particular executor is stuck on GC Pause, what operation are > you performing? You can try increasing the parallelism if you see only 1 > executor is doing the task. > > Thanks > Best Regards > > On Fri, Feb 27, 2015 at 11:39 AM, twinkle sachdeva < > twinkle.sachd...@gmail.com> wrote: > >> Hi, >> >> I am running a spark application on Yarn in cluster mode. >> One of my executor appears to be in hang state, for a long time, and >> gets finally killed by the driver. >> >> As compared to other executors, It have not received StopExecutor message >> from the driver. >> >> Here are the logs at the end of this container (C_1): >> >> -------------------------------------------------------------------------------- >> 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Done removing >> broadcast 36, response is 2 >> 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Sent response: 2 >> to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$aB] >> 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection to >> TMO-GCR70/192.168.162.70:9000 from admin: closed >> 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection to >> TMO-GCR70/192.168.162.70:9000 from admin: stopped, remaining connections >> 0 >> 15/02/26 18:17:32 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [] >> with renew id 1 executed >> 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [] >> with renew id 1 expired >> 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [] >> with renew id 1 exited >> 15/02/26 20:33:13 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED >> SIGNAL 15: SIGTERM >> >> NOTE that it has no logs for more than 2hrs. >> >> Here are the logs at the end of normal container ( C_2): >> >> ------------------------------------------------------------------------------------ >> 15/02/26 20:33:09 DEBUG storage.BlockManagerSlaveActor: Sent response: 2 >> to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$D+b] >> 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor] >> received message StopExecutor from Actor[akka.tcp://sparkDriver@TMO-DN73 >> :37906/user/CoarseGrainedScheduler#160899257] >> 15/02/26 20:33:10 INFO executor.CoarseGrainedExecutorBackend: Driver >> commanded a shutdown >> 15/02/26 20:33:10 INFO storage.MemoryStore: MemoryStore cleared >> 15/02/26 20:33:10 INFO storage.BlockManager: BlockManager stopped >> 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor] >> *handled >> message (181.499835 ms) StopExecutor* from >> Actor[akka.tcp://sparkDriver@TMO-DN73 >> :37906/user/CoarseGrainedScheduler#160899257] >> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator: >> Shutting down remote daemon. >> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator: >> Remote daemon shut down; proceeding with flushing remote transports. >> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator: >> Remoting shut down. >> 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache: >> org.apache.hadoop.ipc.Client@76a68bd4 >> 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache: >> org.apache.hadoop.ipc.Client@76a68bd4 >> 15/02/26 20:33:10 DEBUG ipc.Client: removing client from cache: >> org.apache.hadoop.ipc.Client@76a68bd4 >> 15/02/26 20:33:10 DEBUG ipc.Client: stopping actual client because no >> more references remain: org.apache.hadoop.ipc.Client@76a68bd4 >> 15/02/26 20:33:10 DEBUG ipc.Client: Stopping client >> 15/02/26 20:33:10 DEBUG storage.DiskBlockManager: Shutdown hook called >> 15/02/26 20:33:10 DEBUG util.Utils: Shutdown hook called >> >> At the driver side, i can see the logs related to heartbeat messages from >> C_1 till 20:05:00 >> >> ------------------------------------------------------------------------------------------ >> 15/02/26 20:05:00 DEBUG spark.HeartbeatReceiver: [actor] received message >> Heartbeat(7,[Lscala.Tuple2;@151e5ce6,BlockManagerId(7, TMO-DN73, 34106)) >> from Actor[akka.tcp://sparkExecutor@TMO-DN73:43671/temp/$fn] >> >> After this, it continues to receive the heartbeat from other executors >> except this one, and here follows the message responsible for it's SIGTERM: >> >> >> ------------------------------------------------------------------------------------------------------------ >> >> 15/02/26 20:06:20 WARN storage.BlockManagerMasterActor: Removing >> BlockManager BlockManagerId(7, TMO-DN73, 34106) with no recent heart beats: >> 80515ms exceeds 45000ms >> >> >> I am using spark 1.2.1. >> >> Any pointer(s) ? >> >> >> Thanks, >> >> Twinkle >> > >