Hi, I am running an application on a 2-node cluster, which tries to acquire all the containers that are available on one of those nodes and remaining containers from the other node in the cluster. When I run this application continuously in a loop, one of the NM or RM is getting killed at a random point. There is no corresponding message in the log files.
One of the times that NM had got killed today, the tail of the it's log is like this: 2014-03-04 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: isredeng:52867 sending out status for 16 containers 2014-03-04 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's health-status : true, And at the time of NM's crash, the RM's log has the following entries: 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing isredeng:52867 of type STATUS_UPDATE 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType: NODE_UPDATE 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server Responder: responding to org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes. 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: nodeUpdate: isredeng:52867 clusterResources: <memory:16384, vCores:16> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Node being looked for scheduling isredeng:52867 availableResource: <memory:0, vCores:-8> 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server: got #151 Note: the name of the node on which NM has got killed is isredeng, does it indicate anything from the above message as to why it got killed? Thanks, Kishore