Vinod,

  One more observation I can share is that all the times the NM or RM is
getting killed, I see the following kind of messages in the NM's log

2014-03-05 05:33:23,824 DEBUG
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's
health-status : true,
2014-03-05 05:33:23,824 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(2132631259) connection to isredeng/9.70.137.184:8031 from kbonagir sending
#5391
2014-03-05 05:33:23,826 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(2132631259) connection to isredeng/9.70.137.184:8031 from kbonagir got
value #5391
2014-03-05 05:33:23,826 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
Call: nodeHeartbeat took 2ms


Does that give any clue? Something going wrong while it is getting a node's
health?

Thanks,
Kishore



On Tue, Mar 4, 2014 at 10:51 PM, Vinod Kumar Vavilapalli <vino...@apache.org
> wrote:

> I remember you asking this question before. Check if your OS' OOM killer
> is killing it.
>
> +Vinod
>
> On Mar 4, 2014, at 6:53 AM, Krishna Kishore Bonagiri <
> write2kish...@gmail.com> wrote:
>
> Hi,
>   I am running an application on a 2-node cluster, which tries to acquire
> all the containers that are available on one of those nodes and remaining
> containers from the other node in the cluster. When I run this application
> continuously in a loop, one of the NM or RM is getting killed at a random
> point. There is no corresponding message in the log files.
>
> One of the times that NM had got killed today, the tail of the it's log is
> like this:
>
> 2014-03-04 02:42:44,386 DEBUG
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
> isredeng:52867 sending out status for 16 containers
> 2014-03-04 02:42:44,386 DEBUG
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's
> health-status : true,
>
>
> And at the time of NM's crash, the RM's log has the following entries:
>
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing
> isredeng:52867 of type STATUS_UPDATE
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType:
> NODE_UPDATE
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server
> Responder: responding to
> org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from
> 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes.
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> nodeUpdate: isredeng:52867 clusterResources:
> <memory:16384, vCores:16>
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Node being looked for scheduling isredeng:52867
> availableResource: <memory:0, vCores:-8>
> 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server:  got #151
>
>
> Note: the name of the node on which NM has got killed is isredeng, does it
> indicate anything from the above message as to why it got killed?
>
> Thanks,
> Kishore
>
>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Reply via email to