I remember you asking this question before. Check if your OS' OOM killer is killing it.
+Vinod On Mar 4, 2014, at 6:53 AM, Krishna Kishore Bonagiri <write2kish...@gmail.com> wrote: > Hi, > I am running an application on a 2-node cluster, which tries to acquire all > the containers that are available on one of those nodes and remaining > containers from the other node in the cluster. When I run this application > continuously in a loop, one of the NM or RM is getting killed at a random > point. There is no corresponding message in the log files. > > One of the times that NM had got killed today, the tail of the it's log is > like this: > > 2014-03-04 02:42:44,386 DEBUG > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: > isredeng:52867 sending out status for 16 containers > 2014-03-04 02:42:44,386 DEBUG > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's > health-status : true, > > > And at the time of NM's crash, the RM's log has the following entries: > > 2014-03-04 02:42:40,371 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing > isredeng:52867 of type STATUS_UPDATE > 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher: > Dispatching the event > org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType: > NODE_UPDATE > 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server > Responder: responding to > org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from > 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes. > 2014-03-04 02:42:40,371 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > nodeUpdate: isredeng:52867 clusterResources: > <memory:16384, vCores:16> > 2014-03-04 02:42:40,371 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Node being looked for scheduling isredeng:52867 > availableResource: <memory:0, vCores:-8> > 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server: got #151 > > > Note: the name of the node on which NM has got killed is isredeng, does it > indicate anything from the above message as to why it got killed? > > Thanks, > Kishore > > > -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
signature.asc
Description: Message signed with OpenPGP using GPGMail