Hi,

Operations are not very extensive, as this scenario is not always
reproducible.
One of the executor start behaving in this manner. For this particular
application, we are using 8 cores in one executors, and practically, 4
executors are launched on one machine.

This machine has good config with respect to number of cores.

Somehow, to me it seems to be some akka communication issue. If i try to
take thread dump of the executor, once it appears to be in trouble, then
time out happens.

Can it be something related to* spark.akka.threads?*



On Fri, Feb 27, 2015 at 3:55 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> Mostly, that particular executor is stuck on GC Pause, what operation are
> you performing? You can try increasing the parallelism if you see only 1
> executor is doing the task.
>
> Thanks
> Best Regards
>
> On Fri, Feb 27, 2015 at 11:39 AM, twinkle sachdeva <
> twinkle.sachd...@gmail.com> wrote:
>
>> Hi,
>>
>> I am running a spark application on Yarn in cluster mode.
>> One of my executor appears to be in hang state, for  a long time, and
>> gets finally killed by the driver.
>>
>> As compared to other executors, It have not received StopExecutor message
>> from the driver.
>>
>> Here are the logs at the end of this container (C_1):
>>
>> --------------------------------------------------------------------------------
>> 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Done removing
>> broadcast 36, response is 2
>> 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Sent response: 2
>> to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$aB]
>> 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection to
>> TMO-GCR70/192.168.162.70:9000 from admin: closed
>> 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection to
>> TMO-GCR70/192.168.162.70:9000 from admin: stopped, remaining connections
>> 0
>> 15/02/26 18:17:32 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
>> with renew id 1 executed
>> 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
>> with renew id 1 expired
>> 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
>> with renew id 1 exited
>> 15/02/26 20:33:13 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
>> SIGNAL 15: SIGTERM
>>
>> NOTE that it has no logs for more than 2hrs.
>>
>> Here are the logs at the end of normal container ( C_2):
>>
>> ------------------------------------------------------------------------------------
>> 15/02/26 20:33:09 DEBUG storage.BlockManagerSlaveActor: Sent response: 2
>> to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$D+b]
>> 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor]
>> received message StopExecutor from Actor[akka.tcp://sparkDriver@TMO-DN73
>> :37906/user/CoarseGrainedScheduler#160899257]
>> 15/02/26 20:33:10 INFO executor.CoarseGrainedExecutorBackend: Driver
>> commanded a shutdown
>> 15/02/26 20:33:10 INFO storage.MemoryStore: MemoryStore cleared
>> 15/02/26 20:33:10 INFO storage.BlockManager: BlockManager stopped
>> 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor] 
>> *handled
>> message (181.499835 ms) StopExecutor* from
>> Actor[akka.tcp://sparkDriver@TMO-DN73
>> :37906/user/CoarseGrainedScheduler#160899257]
>> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
>> Shutting down remote daemon.
>> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
>> Remote daemon shut down; proceeding with flushing remote transports.
>> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
>> Remoting shut down.
>> 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache:
>> org.apache.hadoop.ipc.Client@76a68bd4
>> 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache:
>> org.apache.hadoop.ipc.Client@76a68bd4
>> 15/02/26 20:33:10 DEBUG ipc.Client: removing client from cache:
>> org.apache.hadoop.ipc.Client@76a68bd4
>> 15/02/26 20:33:10 DEBUG ipc.Client: stopping actual client because no
>> more references remain: org.apache.hadoop.ipc.Client@76a68bd4
>> 15/02/26 20:33:10 DEBUG ipc.Client: Stopping client
>> 15/02/26 20:33:10 DEBUG storage.DiskBlockManager: Shutdown hook called
>> 15/02/26 20:33:10 DEBUG util.Utils: Shutdown hook called
>>
>> At the driver side, i can see the logs related to heartbeat messages from
>> C_1 till 20:05:00
>>
>> ------------------------------------------------------------------------------------------
>> 15/02/26 20:05:00 DEBUG spark.HeartbeatReceiver: [actor] received message
>> Heartbeat(7,[Lscala.Tuple2;@151e5ce6,BlockManagerId(7, TMO-DN73, 34106))
>> from Actor[akka.tcp://sparkExecutor@TMO-DN73:43671/temp/$fn]
>>
>> After this, it continues to receive the heartbeat from other executors
>> except this one, and here follows the message responsible for it's SIGTERM:
>>
>>
>> ------------------------------------------------------------------------------------------------------------
>>
>> 15/02/26 20:06:20 WARN storage.BlockManagerMasterActor: Removing
>> BlockManager BlockManagerId(7, TMO-DN73, 34106) with no recent heart beats:
>> 80515ms exceeds 45000ms
>>
>>
>> I am using spark 1.2.1.
>>
>> Any pointer(s) ?
>>
>>
>> Thanks,
>>
>> Twinkle
>>
>
>

Reply via email to