Vinod, One more observation I can share is that all the times the NM or RM is getting killed, I see the following kind of messages in the NM's log
2014-03-05 05:33:23,824 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's health-status : true, 2014-03-05 05:33:23,824 DEBUG org.apache.hadoop.ipc.Client: IPC Client (2132631259) connection to isredeng/9.70.137.184:8031 from kbonagir sending #5391 2014-03-05 05:33:23,826 DEBUG org.apache.hadoop.ipc.Client: IPC Client (2132631259) connection to isredeng/9.70.137.184:8031 from kbonagir got value #5391 2014-03-05 05:33:23,826 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: nodeHeartbeat took 2ms Does that give any clue? Something going wrong while it is getting a node's health? Thanks, Kishore On Tue, Mar 4, 2014 at 10:51 PM, Vinod Kumar Vavilapalli <vino...@apache.org > wrote: > I remember you asking this question before. Check if your OS' OOM killer > is killing it. > > +Vinod > > On Mar 4, 2014, at 6:53 AM, Krishna Kishore Bonagiri < > write2kish...@gmail.com> wrote: > > Hi, > I am running an application on a 2-node cluster, which tries to acquire > all the containers that are available on one of those nodes and remaining > containers from the other node in the cluster. When I run this application > continuously in a loop, one of the NM or RM is getting killed at a random > point. There is no corresponding message in the log files. > > One of the times that NM had got killed today, the tail of the it's log is > like this: > > 2014-03-04 02:42:44,386 DEBUG > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: > isredeng:52867 sending out status for 16 containers > 2014-03-04 02:42:44,386 DEBUG > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's > health-status : true, > > > And at the time of NM's crash, the RM's log has the following entries: > > 2014-03-04 02:42:40,371 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing > isredeng:52867 of type STATUS_UPDATE > 2014-03-04 02:42:40,371 DEBUG > org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event > org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType: > NODE_UPDATE > 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server > Responder: responding to > org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from > 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes. > 2014-03-04 02:42:40,371 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > nodeUpdate: isredeng:52867 clusterResources: > <memory:16384, vCores:16> > 2014-03-04 02:42:40,371 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Node being looked for scheduling isredeng:52867 > availableResource: <memory:0, vCores:-8> > 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server: got #151 > > > Note: the name of the node on which NM has got killed is isredeng, does it > indicate anything from the above message as to why it got killed? > > Thanks, > Kishore > > > > > > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity > to which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You.